Skip to main content

Using Propensity Scores

  • Chapter
  • First Online:
Multilevel Modeling of Social Problems
  • 1814 Accesses

Abstract

This chapter examines how both Co-nect’s comprehensive school reforms and reduced student-to-teacher ratios engendered change in the schools’ proportions of students passing tests of reading, mathematics, and fourth grade writing. Using propensity scores that aim to minimize any bias due to the selection of the target or comparison schools, a difference-in-differences (DID) design compares the change on test outcomes in seven elementary schools that participated in the reforms (the target schools) in the time periods before and after the implementation of the reforms, to the change on those outcomes in six comparable elementary schools (the comparison schools) during those same time periods. Several of the response variables are measured on changing cohorts (i.e., quasi-cohorts) of third through fifth graders in Houston, Texas. Even though the mobility of the students into and out of the schools no doubt changed the original composition of the cohorts, henceforth this chapter refers to these quasi-cohorts simply as cohorts. The tests are components of the Texas Assessment of Academic Skills, hereafter referred to as TAAS. In the spring of each school year (SY) the teachers administered these tests in the English language to all capable students. Although this study probes the effects of the reforms after 1 year, from third to fourth grade, it also studies change from fourth to fifth grade and overall change from third to fifth grade; it often refers to these periods as Time 0, Time 1, and Time 2. Relative to change in the comparison schools, the cohorts of students in target schools generally improved their percent passing from third to fourth grade, but by fifth grade the cohorts in both groups of schools were performing about equally, near the ceiling of 100% passing. This chapter attributes this favorable change in the comparison cohorts to the extra teachers assigned to previously low-performing schools to prepare their students for the fifth grade tests.1

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Cindy Martin, a Co-nect consultant to the Houston target schools, mentioned these extra teachers in a note (8/19/2003):

    During the 2001–2002 school year North Central District in HISD [Houston Independent School District] concentrated significant district personnel resources to assist schools who were having difficulty in raising their test scores. Additional support was given mostly by North Central supervisors but other available personnel were also used. Assisting these schools became the primary duty of the supervisors assigned [to the] schools with weak test scores. Support personnel worked with teachers and students by demonstrating test taking skills, providing instructional support through modeling lessons, tutoring, working with school leadership in action planning, etc.

    The reductions in the student-to-teacher ratios corroborate Martin’s observations. Since the time of her note, the North Central District has been eliminated; Reagan is now located in the Central District. The other districts are North, South, East, and West.

  2. 2.

    The website circa July 2009 for the Reagan High School presents an optimistic image; the school is undergoing comprehensive reform and has a new focus on the teaching of computer programming.

  3. 3.

    In 2002 the Houston Independent School District (HISD) awarded bonuses to school personnel for improvements in student achievement as measured by standardized tests. Some teachers’ aides and janitors received $200 whereas the school superintendent received $25,000, four subdistrict superintendents received $20,000, and other administrators and teachers received awards of $2,500 to $15,000. Because these incentive payments were based on the students’ test scores, a number of high schools cheated: the schools’ administrators arranged that low performing students not take the definitive tenth grade test. Some students were held back a grade, others were erroneously classified as requiring special education or having limited proficiency in English, still others were encouraged to drop out—at that time dropout rates were not calculated stringently. The net result was a pruning of the pool of potential test-takers so that only the relatively good students would take the tests. For further details see the investigative reporting by Peabody (2003) and Schemo (2003, July 11)

  4. 4.

    SFA aims to improve achievement outcomes in reading, writing, mathematics, science, and social studies, as well as to improve student attendance, promotion, and discipline. Although it is generally thought to be one of the most successful CSR programs, this chapter reports some negative effects.

  5. 5.

    Information provided by Co-nect consultants in Houston classified the schools as implementing Success for All or not. This information was based on their personal knowledge and telephone conversations with school administrators.

  6. 6.

    In their summary of research findings indicating favorable impacts of smaller class size and learning, Murphy and Bella Rosenberg (1998, June 1) state:

    Pupil–teacher ratio and class size technically are not the same thing. Pupil–teacher ratio refers to the number of students divided by the number of staff classified as teachers. Class size refers to the actual number of students in a classroom with a teacher. Pupil–teacher ratio is typically lower than average class size but often used as an approximate measure.

    This chapter uses the student-to-teacher ratio as an approximate measure of class size.

  7. 7.

    The research literature on class size is vast and equivocal: The comprehensive meta-analyses of Glass and Smith (1978, 1979) found that smaller classes not only lead to higher quality schooling and achievement, but also to more positive attitudes. Reporting findings from the Project STAR (Student Teacher Achievement Ratio) studies, conducted in Tennessee from 1989 to 1989, Wood et al. (1990), Frederick Mosteller (1995), and Finn and Achilles (1999) find that smaller class size is especially beneficial for minority children from economically disadvantage families. Jepson and Rivkin (2009) underscore the trade-off between teacher quality and class size. Hoxby (2000) finds that class size does not have a statistically significant effect on the achievement of students in each grade of 649 elementary schools; the author rules out even modest effect sizes of 2–4% of a standard deviation in achievement scores for a 10% reduction in class size. Imbens (2009, 6–9) comments on Hoxby’s study and others.

  8. 8.

    Co-nect’s benchmarking scores are derived from surveys of the school’s faculty and can be used to assess implementation quality. There are five content areas: community accountability (e.g., high expectations, beneficent school climate, shared school improvement plan, community review process, family and community engagement); high-quality teaching and learning (e.g., projects and project-based learning, thoughtful discourse, coherent conceptual curriculum, reading and writing, and quantitative reasoning across the curriculum); comprehensive assessment (e.g., regular classroom assessment, multiple outcome measures, use of school-level data, community reporting system); team-based school organization (participatory instructional leadership, cohesive school community, appropriate grouping, supportive and flexible schedule, collaborative school community); sensible use of technology (e.g., shared vision, access, skillful use in teaching, quality technology training and support, continuous assessment of technology effectiveness). The scores for each of the five dimensions are averaged, leading to an overall percentage score for the school. For 2001 and 2002, respectively, the scores expressed as proportions for these Co-nect schools are: Browning (0.492, 0.426); Crockett (0.418, 0.369); Helms (0.392, 0.370); Love (0.354, 0.477); Milam (0.456, 0.583) Stevenson (0.423, 0.423), and Travis (0,0). From these proportions implementation scores can be derived by setting all observations on the schools to 1, and then adding these proportions to the values of Co-nect schools in the appropriate time periods. These implementation scores could be used as weights in the mixed modeling, but they to not closely correspond to qualitative observations that implementation of Co-nect reforms was strongest in Helms, Browning, and Love. For the data of this evaluation the unweighted estimates are preferred because there is no information about the Travis school and no information about the implementation of Success for All. Moreover (SFA), the data are not weighted by the number of students in the school, a more obvious weight. Such weightings have very little effect on the results.

  9. 9.

    Other comparison groups were not added to the design because it was not known which school reforms, if any, were present in the schools that were candidates for these groups.

  10. 10.

    The author will provide a proof of these relationships upon request.

  11. 11.

    Wooldridge (2002, 128–132; 2006, 454–459) explicates the difference-in-differences estimator for pooled cross sections: that is, treatment versus control before and after an intervention.

  12. 12.

    The/Firth option on the model statement of Proc Logistic will provide these estimates.

  13. 13.

    Project Grad aims to improve educational outcomes in a feeder system through a combination of curriculum change, persuading students to achieve, and financial incentives that offset some of costs of college for those who graduate.

  14. 14.

    Imbens and Wooldridge (2009, 33) note that: “individuals with propensity scores of 0.45 and 0.50 are likely to be much more similar than individuals with propensity scores equal to 0.01 and 0.06.” The propensity scores in this analysis are not extreme and this supports their use in a regression model.

  15. 15.

    Given the decomposition rules of path-analytic models, the zero-order effect of the treatment equals the conditioned effect of the treatment plus the spurious component due to antecedent variables like the propensity scores and SFA. When the spurious components are positively related to both assignment to the treatment and the outcome, then the conditioned effect of the treatment equals the zero-order effect minus the spurious component; the conditioned effect of the treatment will be smaller than the zero-order effect. In this analysis SFA and the propensity scores both have negative effects on the outcomes and positive effects on assignment to the treatment. Thus, the spurious component has a negative sign and the conditioned effect equals the zero-order effect plus the spurious component; the conditioned effect is larger than the zero-order effect by the amount of spuriousness. The classic literature refers to this as “suppression” of the zero-order effect by the absence of controls for the other variables.

  16. 16.

    In Figure 12.2 for the models bearing on the fourth grade writing tests, an alternative covariance structure, compound symmetry, is applied rather than the UN(1), which is used subsequently, because the latter’s value of the generalized χ 2 divided by the degrees of freedom is always 1.

  17. 17.

    When SFA and the propensity scores are deleted from the full models and the reduced models are re-estimated, then the fits of the reading and writing models worsen: the χ 2df increase to 5.02 and 5.31; respectively. For mathematics the χ 2df increases slightly to 11.05. The χ 2df = 3.22 for the average of reading and mathematics is about unchanged.

  18. 18.

    These different measures illustrate various distinctions of Lazarsfeld and Menzel (1972, 227–229) concerning properties of collectives. The measures on cohorts of students in the schools are analytical properties of the schools; the writing tests are analytical properties of the fourth grade of the schools; and the quality ratings are pseudo-global characteristics of the schools.

  19. 19.

    Jill Tao of SAS clarifies the function of this nesting operator in this applications as follows:

    Hi Bob,

    The reason why the results are identical between subject = school and subject = school(treatment) is probably because your schools have unique ids; in other words, schools are uniquely identified by the value of itself. You will see different results between the two specifications when the same set of school ids are used across treatments. For example, for treatment 1, you have schools 1 to 10, for treatment 2, the schools are also labeled 1 to 10, although they are different schools, in which case, you must use school (treatment) to uniquely identify the schools.

    The GROUP = option does totally different things. When you say GROUP = treatment, it instructs Proc Mixed to estimate separate sets of variance-covariance for different treatment. So if you have 3 treatments, you would have 3 sets of the covariance parameter estimates, one for each treatment.

  20. 20.

    If the UN option is used, then only variance and covariances are printed out and these data appear in a different order, thus necessitating a different code for this test.

  21. 21.

    The estimate statements for the overall change and change from Time 1 to Time 2 follow:

    Title ‘Difference in differences estimator for the third time period (Time 2)relative to the first (Time 0)’;

    * for comparison group;

    estimate ‘0,2 - 0,0’ period -1 0 1 treatment*period -1 0 1;

    *for the target group;

    estimate ‘1,2 - 1,0’ period -1 0 1 treatment*period 0 0 0 -1 0 1;

    *this estimates the difference between the two differences above;

    estimate ‘(1,2 - 1,0)vs (0,2 - 0,0)’treatment*period 1 0 -1 -1 0 1;

    Title ‘Difference in differences estimator for the third time period (Time 2)relative to the second (Time 1)’;

    *for the comparison group;

    estimate ‘0,2 - 0,1’ period 0 -1 1 treatment*period 0 -1 1;

    *for the target group;

    estimate ‘1,2 - 1,1’ period 0 -1 1 treatment*period 0 0 0 0 -1 1;

    *this estimates the difference between the two differences above;

    estimate ‘(1,2 - 1,1)vs(0,2 - 0,1)’treatment*period 0 1 -1 0 -1 1;

  22. 22.

    The lsmestimate statements have this form:

    *DID estimates for Time 2 - Time 0;

    lsmestimate treatment*period ‘diff in comp group’ -1 0 1;

    lsmestimate treatment*period ‘dif in trt group’ 0 0 0 -1 0 1;

    lsmestimate treatment*period ‘(1,1–1,0) versus (0,1 - 0,0)’

    1 0 -1 -1 0 1/or ilink cl;

    *DID estimates for Time 2 - Time 1;

    lsmestimate treatment*period ‘diff in comp group’ 0 -1 1;

    lsmestimate treatment*period ‘dif in trt group’ 0 0 0 0 -1 1;

    lsmestimate treatment*period ‘(1,1–1,0) versus (0,1 - 0,0)’

  23. 23.

    DID12 can be calculated easily from the proportions of Box 12.6 by taking the difference between the proportions in the treatment group minus the difference between the proportions in the comparison group for those time periods: (0.9683 −0.9668) − (0.8918 −0.7528) = 0.0015 −0.139 = −0.1375. For the comparison group this difference is (−1)(−0.1375) = 0.1375.

  24. 24.

    For reading tests, the estimated V correlation matrix for the school nested by treatment, based on a CS covariance structure looks like this:

    Row

    Column 1

    Column 2

    Column 3

    1

    1.0000

    0.5470

    0.5470

    2

    0.5470

    1.0000

    0.5470

    3

    0.5470

    0.5470

    1.0000

    All on-diagonal correlations have the same value and all off-diagonal elements have the same value. The R-side covariance parameter estimates are school (treatment) = 4.47 (0, 10.7268), Z = 1.4, Pr Z = 0.1614; the residual = 3.702 (2.1391, 7.9112), Z = 3.08, Pr Z = 0.001.

  25. 25.

    For mathematics tests, the estimated V correlation matrix for the school nested by treatment, based on a TOEP(2) covariance structure looks like this:

    Row

    Column 1

    Column 2

    Column 3

    1

    1.0000

    0.6066

     

    2

    0.6066

    1.0000

    0.6066

    3

     

    0.6066

    1.0000

    There are two bands of information; this structure lacks parameters in row 3 and column 1 and in row 1 and column 3. The TOEP(2) R-side covariance parameters are both statistically significant. School (treatment) = 7.21.47 (1.5288, 12.8913), Z = 2.49.4, Pr Z = 0.0129; the residual = 11.8856 (6.8351, 25.6330), Z = 3.05, Pr Z = 0.0012.

  26. 26.

    On the average of the reading and mathematics tests, the estimated V correlation matrix for the schools nested by treatment, based on a CS covariance structure looks like this:

    Row

    Column 1

    Column 2

    Column 3

    1

    1.0000

    0.5415

    0.5415

    2

    0.5415

    1.0000

    0.5415

    3

    0.5415

    0.5415

    1.0000

    All on-diagonal correlations have the same value and all off-diagonal elements have the same value. The R-side covariance parameter estimates are school (treatment) = 3.836 (0, 9.1123), Z = 1.42, Pr Z = 0.1542; the residual = 3.2486 (1.8618, 7.0528), Z 7= 3.02, Pr Z = 0.0012.

  27. 27.

    For the fourth grade writing tests, the estimated V correlation matrix for the school nested by treatment, based on a UN(1) covariance structure looks like this:

    Row

    Column 1

    Column 2

    Column 3

    1

    1.0000

      

    2

     

    1.0000

     

    3

      

    1.0000

    All on-diagonal correlations have the same value and all off-diagonal elements are null. The R-side covariance parameter estimates are: UN (1, 1) school (treatment) = 2.2318 (0.9381, 10.414), Z = 1.76, Pr Z = 0.0389; UN (2, 2) school (treatment) = 14.2834 (6.9128, 44.9467), Z = 2.2, Pr Z = 0.0138; and UN (3, 3) school (treatment) = 4.5156 (2.1249, 15.2605), Z = 2.1, Pr Z = 0.0177. The generalized χ 2 = 26; χ 2/df = 1.00.

  28. 28.

    The section on "Corrections for Multiplicity" in the previous chapter explicated the step-down Bonferroni. To illustrate the calculations for the step-up false discovery rate using the four obtained p-values of panel 12.1, first organize them in ascending order from left to right as done in the box below, and then apply the equations below (SAS 1997a, 802):

    Number of Parameters

    R-3 = 1

    R-2 = 2

    R-1 = 3

    R = 4

    Symbols

    p 1

    p 2

    p 3

    p 4

    Raw p-values

    0.005

    0.012

    0.038

    0.121

    Adjusted ps False Discovery Rate

    s 1 = 0.020

    s 2 = 0.024

    s 3 = 0.051

    s 4 = 0.121

    Adjusted ps Bonferroni

    s 1 = 0.020

    s 2 = 0.036

    s 3 = 0.076

    s 4 = 0.121

    s R = pR = 0.121;

    s (R-1) = min (s R, [R/(R – 1)p(R – 1)] = 0.121 or (4/3)(0.038) = 0.121 or 0.0505 = 0.051;

    s (R-2) = min (s (R-1), [R/(R – 2)p(R – 2)] = 0.051 or (4/2)(0.012) = 0.051 or 0.024;

    s (R-3) = min (s (R-2), [R/(R – 3)p(R – 3)] = 0.024 or (4/1)(0.005) = 0.020.

    The false discovery correction retains the significance of three of the three significant raw probabilities; the more stringent Bonferroni retains the significance of only two. SAS (1997a, 798–802) discusses the assumptions of these and other tests.

References

  • Achen, Christopher H. 1986. The statistical analysis of quasi-experiments. Berkeley: University of California Press.

    Google Scholar 

  • Aladjem, Daniel K., Kerstin Carlson LeFloch, Yu Zhang, Anja Kurki, Andrea Boyle, James E. Taylor, Suzannah Herrmann, Kazuaki Uekawa, Kerri Thomsen, and Olatokunbo Fashola. September 2006. Models matter—the final report of the national longitudinal evaluation of comprehensive school reform. Washington, DC: American Institutes of Research.

    Google Scholar 

  • Borman, Geoffrey D., and Gina M. Hewes. 2002. The long-term effects and cost-effectiveness of success for all. Educational Evaluation and Policy Analysis 24: 243–266.

    Article  Google Scholar 

  • DerSimonian, Rebecca, and Nan Laird. 1986. Meta-analysis in clinical trials. Controlled Clinical Trials 7: 177–188.

    Article  Google Scholar 

  • Finn, J.D., and C.M. Achilles. 1999. Tennessee’s class size study: Findings, implications, misconceptions. Educational Evaluation and Policy Analysis 21: 97–109.

    Google Scholar 

  • Firth, David. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80: 27–38.

    Article  Google Scholar 

  • Fleiss, Joseph L. 1981. Statistical methods for rates and proportions. New York: Wiley.

    Google Scholar 

  • Glass, Gene, and Mary Smith. 1978. Meta-analysis of research on the relationship of class size and achievement. San Francisco, CA: Far West Laboratory for Educational Research and Development.

    Google Scholar 

  • Glass, Gene, and Mary Smith. 1979. Relationship of class size to classroom processes, teacher satisfaction, and pupil affect: A meta-analysis. San Francisco, CA: Far West Laboratory for Educational Research and Development. July.

    Google Scholar 

  • Hoxby, Caroline M. 2000. The effect of class size on student achievement: New evidence from population variation. Quarterly Journal of Economics 115(4): 1239–1285.

    Article  Google Scholar 

  • Imbens, Guido I. 2009. Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Department of Economics, Harvard University, 1–30.

    Google Scholar 

  • Imbens, Guido I., and Jeffrey M. Wooldridge. 2009. Recent developments in the econometrics of program evaluation. Journal of Economic Literature 47(1): 5–86.

    Article  Google Scholar 

  • Jepson, Christopher, and Steven Rivkin. 2009. Class size reduction and student achievement: A potential tradeoff between teacher quality and class size. Journal of Human Resources 44(1): 223–250.

    Google Scholar 

  • Lazarsfeld, Paul F., and Herbert Menzel. [1961] 1972. On the relation between individual and collective properties. In Continuities in the language of social research, ed. Paul F. Lazarsfeld, Ann K. Pasanella, and Morris Rosenberg, 219–237. New York: The Free Press.

    Google Scholar 

  • Littell, Ramon C., George A. Milliken, Walter W. Stroup, Russell D. Wolfinger, and Oliver Schabenberger. 2006. SAS for mixed models, 2nd ed. Cary, NC: SAS Institute.

    Google Scholar 

  • Mosteller, Frederick. 1995. The Tennessee study of class size in the early grades. Critical Issues for Children and Youth 5(Summer/Fall): 113–127.

    Google Scholar 

  • Murphy, Dan, and Bella Rosenberg. 1998. Recent research shows major benefit of small class size. Educational Issues Policy Brief 31: Washington DC, American Federation of Teachers, 1–3.

    Google Scholar 

  • Peabody, Zanto. 2003. Sharpstown numbers game shows flaw in school reform. Houston Chronicle, July 5.

    Google Scholar 

  • Rosenbaum, Paul R., and Donald B. Rubin. 1985. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39: 33–38.

    Article  Google Scholar 

  • Rosenthal, Robert. 1991. Meta-analytic procedures for social research, Applied Social Research Methods Series, vol. 6, revisedth ed. Newbury Park, CA: Sage.

    Google Scholar 

  • SAS Institute. 2005. The Glimmix procedure, November. Cary, NC: SAS Institute.

    Google Scholar 

  • Schemo, Diana J. 2003. Questions on data cloud luster of Houston schools. New York Times, July 11.

    Google Scholar 

  • Seffrin, John R. 2009. Letter to the Editor: Re “In Long Drive to Cure Cancer, Advances Have Been Elusive” (“Forty Years’ War” series, front page, April 24). New York Times, published April 27.

    Google Scholar 

  • Singer, Judith D., and John B. Willett. 2003. Applied longitudinal data analysis. New York: Oxford University Press.

    Book  Google Scholar 

  • Wood, Elizabeth, et al. 1990. Project STAR final executive summary report: Kindergarten through third grade (1985–1989). Nashville, TN: Tennessee State Department of Education.

    Google Scholar 

  • Wooldridge, Jeffrey M. 2002. Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.

    Google Scholar 

  • Wooldridge, Jeffrey M. 2006. Introductory econometrics: A modern approach, 3rd ed. Mason, OH: Thomson/South-Western.

    Google Scholar 

Download references

Acknowledgments

The author produced the initial evaluation report (Smith, September 19, 2003c) as a “third-party evaluator.” Dr. Bruce Goldberg of Co-nect was the contracting officer and monitored the project; Sarah Cook of Co-nect provided research assistance; and Dr. Henry May of the University of Pennsylvania critiqued the evaluation report. That report was included in a meta-analysis conducted by Aladjem et al. (September 2006). Because the meta-analysts focused only on the overall effects of the reforms and not on the dynamics of the change, they underreported the effects of the Co-nect reforms, which this chapter explicates. Vivek Pradhan clarified aspects of separation in logistic regression models and pointed to Firth’s solution to this problem. Philip Gibbs and Jill Tao of SAS answered my numerous questions about Proc Glimmix, covariance structures, and the programming of the difference-in-differences estimators. However, the author alone is responsible for the views expressed here, and not Co-nect (which is now a subsidiary of Pearson) or any other organization or person.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert B. Smith .

Endnotes

Endnotes

Footnote 1 Footnote 2 Footnote 3 Footnote 4 Footnote 5 Footnote 6 Footnote 7 Footnote 8 Footnote 9 Footnote 10 Footnote 11 Footnote 12 Footnote 13 Footnote 14 Footnote 15 Footnote 16 Footnote 17 Footnote 18 Footnote 19 Footnote 20 Footnote 21 Footnote 22 Footnote 23 Footnote 24 Footnote 25 Footnote 26 Footnote 27 Footnote 28

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Smith, R.B. (2011). Using Propensity Scores. In: Multilevel Modeling of Social Problems. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9855-9_12

Download citation

Publish with us

Policies and ethics