Skip to main content

Categorical and Limited Dependent Variable Modeling in Higher Education

  • Chapter
  • First Online:
Higher Education: Handbook of Theory and Research

Abstract

Higher education researchers have applied increasingly sophisticated regression techniques to the study of many important issues. Historically, the statistical workhorse of this work has been linear regression, which has several desirable properties for analyzing continuous outcomes, and under certain assumptions yields unbiased coefficient estimates. However, several outcomes of interest to higher education scholars are categorical or limited in the values they assume, and using linear regression to study them may violate important assumptions. Herein we provide an overview of regression techniques often employed when studying categorical or limited dependent variables. We begin by discussing the modeling of binary outcomes, which are often studied using linear probability, logistic, or probit models. We then consider dependent variables with multiple categories, modeled using ordinal and multinomial regression methods. We also discuss the use of models for other limited dependent variables, including counts, fractions, and censored or truncated outcomes. Throughout the chapter, we apply these techniques to the study of students’ college choice using a relatively new data set available from the National Center for Education Statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Although we utilized a restricted version of the HSLS data, there is also a publicly available version (see https://nces.ed.gov/edat/).

  2. 2.

    Defined as two- or four-year college enrollment as of November of 2013.

  3. 3.

    For studies with few observations (e.g., fewer than 100), use exact logistic (Mehta & Patel, 1995).

  4. 4.

    Our students were nested within high schools, which might suggest we adjust standard errors due to the heterogeneity found within high schools through the use of a vce cluster Stata option. However, there is a tradeoff here. As Long and Freese (2014) discuss, using robust standard errors no longer makes maximum likelihood an appropriate estimator. After comparing our model with and without school-level clustered errors, we confirmed little difference in our findings and decided to proceed without the robust errors. Models that include robust standard errors should rely on the Wald, rather than the likelihood test (Sribney, n.d.).

  5. 5.

    2*[(−6988)-(−5533)].

  6. 6.

    If using survey data, Archer and Lemeshow (2006) argue one should account for survey sampling design to calculate goodness-of-fit using the Stata command svylogitgof.

  7. 7.

    Stata will automatically output odds ratios instead of raw coefficients by using the logistic command, or one can obtain odds ratios by invoking the option when using the logit command.

  8. 8.

    A logistic regression model with college enrollment as the outcome and gender as the only covariate will confirm that the odds ratio is indeed 1.41 (p < 0.001).

  9. 9.

    In other words, there is a built-in nonlinearity to the relationship between each covariate and the outcome. However, even with this nonlinearity imposed by the functional form, researchers still need to consider whether any higher order (i.e., polynomials) of covariates are appropriate to account for nonlinear relationships in the logit (or log-odds).

  10. 10.

    Computed by adding the atmeans option when using the Stata margins command.

  11. 11.

    A formal statistical test can be applied to test the difference between two probabilities using mtable and mlincom. For more, see Long and Freese (2014).

  12. 12.

    The default classification threshold in Stata is a probability of 0.5 – observations with probabilities above 0.5 are classified as 1; 0 otherwise.

  13. 13.

    To be clear, there are no absolute definition of an area under the curve measure that is a “good fit,” but rather rules of thumb ranges: 0.5 is no discrimination (or no better than chance); 0.5 to 0.7 is considered poor; 0.7 to 0.8 is acceptable; 0.8 to 0.9 is excellent; and greater than 0.9 is outstanding (Hosmer et al., 2013).

  14. 14.

    An additional approach that is not discussed here but may of use to higher education researchers is the sequential logit, which models events that individual experience in sequence—for example course-taking (Algebra I, Algebra II, Pre-Calculus); admission stages (application, admission, enrollment); tenure-track faculty positions (Assistant, Associate, Full).

  15. 15.

    For a graphical representation of the cut points, see Long (1997).

  16. 16.

    See Long (1997) for the derivation of the parallel regression assumption.

  17. 17.

    The generalized ordered logit model does not assume that the \( \widehat{\beta} \)’s are equal. See Long and Freese (2014).

  18. 18.

    For brevity, we do not include hypothesis tests of the ordered logit or probit, but refer readers to Long and Freese’s (2014) overview.

  19. 19.

    For more on generalized ordered logit models, see Long and Freese (2014).

  20. 20.

    Marginal effects are useful here in comparing across models.

  21. 21.

    This is why this model is often called the “cumulative logit model.”

  22. 22.

    Researchers should be careful to distinguish between the risk ratio and odds ratio, as they are not interchangeable terms. In particular, odds ratios and risk ratios are most dissimilar in the middle of a distribution (Menard, 2010). Only when J = 2 are the relative risk ratio and odds ratio equal. For a clear explanation, see https://www.stata.com/statalist/archive/2005-04/msg00678.html

  23. 23.

    The IIA also applies to the conditional logit (not discussed here).

  24. 24.

    For an explanation of odds and risk ratios/relative risks see: http://www.theanalysisfactor.com/the-difference-between-relative-risk-and-odds-ratios/

  25. 25.

    Only the Wald test works when using robust standard errors or survey commands, See Long and Freese (2014) for a discussion of tradeoffs between the Wald and likelihood ratio tests.

  26. 26.

    Note these are increases in probability, rather than odds.

  27. 27.

    If overdispersion is the result of excess zeros, a zero-inflated Poisson model may be preferable over negative binomial regression (Long, 1997).

  28. 28.

    Recall that if we were concerned about the 14% of students that do not apply to college (and thus have a count of zero), or if we wanted to understand the decision not to apply to college separately, we could estimate a zero-inflated count model.

  29. 29.

    This may seem like a large number of goodness of fit tests to run, but in Stata the user-written command countfit provides all of these results simultaneously.

  30. 30.

    One could also use heteroskedastic probit to model the variance rather than the mean of a proportional outcome.

  31. 31.

    Tobit models also work for censoring from above, such as when data from surveys top-code variables like income for privacy reasons.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Awilda Rodriguez .

Editor information

Editors and Affiliations

Appendix

Appendix

/*********************************************************************************************** These are examples of commands used to estimates the models in the chapter. The full code is not contained here for space constraints. ***********************************************************************************************/ **Set directories, open data, start log as needed. *set macro vars global $iv = " " *enrl_college is the outcome variable we created. **** Goodness of Fit ****** * unconditional model logit enrl_college, or *full model logit enrl_college $iv estimates store loges predict loges, pr *describing the pred probs predict pprob5 set scheme s2mono histogram pprob5, title("", color(black) margin(zero) size(small)) /// xti("Predicted probabily", size(small)) graphregion(color(white)) /// plotregion(color(white)) yti("Density", size(small)) summarize pprob5 *examining LR fitstat *examining classification estat classification lsens, title("", color(black) margin(zero) size(small)) /// graphregion(color(white)) plotregion(color(white)) xti(,size(small)) yti(,size(small)) lroc, title("", color(black) margin(zero) size(small)) /// graphregion(color(white)) plotregion(color(white)) xti(,size(small)) yti(,size(small)) /******LOGIT*****/ estimates restore loges margins, dydx(*) post estimates store loges_me *graphing estimates restore loges margins, dydx(gpa) asobserved at(gpa=(1 (.25) 4)) set scheme s2mono marginsplot, recastci(rarea) recast(line) ciopts(color(*.7)) /// graphregion(color(white)) plotregion(color(white)) ti("") yti("Change in Pr(Enroll)", size(small)) xti("GPA, 10th grade", size(small)) *look at a few populations of interest mtable, rowname(1 Female first-gen low-inc ) ci clear at(student_gender==2 parental_ed==1 family_income==(1 2) ) atmeans /******PROBIT*****/ probit enrl_college $iv estimates store probes predict probes, pr margins, dydx(*) post estimates store probes_me /******LPM *****/ regress enrl_college i.student_gender $dems $acad $expct $netwk $sch estimates store lpm predict lpm, xb *diagnostic of lpm histogram lpm set scheme s2mono histogram lpm, title("", color(black) margin(zero) size(small)) /// xti("Predicted probabily", size(small)) graphregion(color(white)) plotregion(color(white)) yti("Density", size(small)) xline(0 1, lstyle(foreground) lpattern("--")) *plot residual v fitted set scheme s2mono rvfplot, yline(0, lstyle(foreground) lpattern("--")) graphregion(color(white)) plotregion(color(white)) xline(0 1, lstyle(foreground) lpattern("--")) xti(, size(small)) yti(, size(small)) *check for heteroskedasticity estat imtest ************************ORDINAL/MULTINOMIAL*********************** *pse_enroll_sel is the dependent var we created. ologit pse_enroll_sel i.student_gender $iv, or estimates store ord *get some marginal effects estimates restore ord margins, dydx(gpa) post estimates store ord_me predict nocol_log lsel_log sel_log msel_log *test if we need multinomial oparallel, ic brant, detail ***run it as multinomial mlogit pse_enroll_sel $iv, rrr estimates store multi *get a marginal effect margins, dydx(gpa) post estimates store multi_me *tests of IVs estimates restore multi mlogtest, lr estimates restore multi mlogtest, wald *Test of categories - can we collapse them? mlogtest, combine estimates restore multi mlogtest, lrcomb estimates restore multi mlogtest, hausman *Interpretation estimates restore multi listcoef student_gender student_race_combo stugpa_10 stu_mathirt apcred, gt adjacent *Pred Probs for select subgroups estimates restore multi mtable if student_gender==2 & parental_ed==1 & family_income==1, atmeans noci rowname(lowinc firstg) clear brief ************************COUNT************************ poisson apps $iv, irr estimates store pois estat ic prcounts pois, max(20) plot label var poispreq "Poisson" labe var poisobeq "Observed" label var poisval "# of apps" nbreg apps $iv, irr estimates store nb estat ic countfit apps $iv, nbreg prm

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rodriguez, A., Furquim, F., DesJardins, S.L. (2018). Categorical and Limited Dependent Variable Modeling in Higher Education. In: Paulsen, M. (eds) Higher Education: Handbook of Theory and Research. Higher Education: Handbook of Theory and Research, vol 33. Springer, Cham. https://doi.org/10.1007/978-3-319-72490-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72490-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72489-8

  • Online ISBN: 978-3-319-72490-4

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics