September 8, 2014

Teacher Evaluation Reform: Building Research to Inform Policy

By: Matthew P. Steinberg

Student achievement is influenced by many factors outside of a school’s control, including students’ baseline knowledge and skills when they enter the system and their families’ income level and overall economic stability. However, of the factors that are within a school’s sphere of influence, teachers are the most important for improving student outcomes. A teacher’s influence is particularly crucial in large urban districts, where student achievement tends to be low and neighborhood disadvantage high.

Given the critical role teachers play in the educational lives of their students, school districts need to be able to identify which teachers are the most capable and effective. However, historically, districts have done a poor job at identifying effective teachers. For example, a widely-cited study by the New Teacher Project found that more than 99% of teachers were rated satisfactory in districts that use binary evaluation ratings—in other words, that offer “satisfactory” or “unsatisfactory” as the mutually exclusive choices for judging characteristics of teacher performance. Traditional teacher evaluation systems have historically failed to meet two essential objectives of personnel evaluation: improving teachers’ instructional practice and identifying and removing low-performing teachers from the classroom. 

Personnel management systems in public school districts should be designed to identify high-performing teachers so that they can be retained while remediating and, if necessary, removing low-performing teachers from the teaching force. Recognizing the need to improve our evaluation systems, the Obama administration, through its Race to the Top (RTTT) initiative, funded states and districts to invest in and develop evidence-based, incentive-laced systems to reform teacher evaluation. These systems incorporate multiple measures of teacher performance, including classroom observation rubrics that define instructional improvement on a continuum (rather than by a binary checklist) and, most notably, measures of student performance on state standardized tests (so-called value-added measures, or VAMs).

I have recently conducted a series of studies that examine teacher evaluation reform efforts. These studies pay particular attention to the ways teacher evaluation reform plays out in public schools in urban settings, where the juggernaut of reform has taken hold earlier, and been implemented far more vigorously, than elsewhere around the nation.

In an ongoing study in Chicago Public Schools (CPS), Lauren Sartain from the University of Chicago Consortium on Chicago School Research and I examine the extent to which teacher evaluation reform satisfies the two objectives of personnel evaluation. In one paper, we find that well-developed evaluation systems that use multiple observations of teachers with ongoing formative feedback can improve teacher practice and student performance. In a second paper, we show that, when more detailed information about teacher performance is provided to principals, principals are better able to identify low-performing teachers, and these low-performing teachers are more likely to exit the district. These findings suggest that intensive, structured classroom observation of teacher practice provides additional information to both teachers and principals about teacher performance.

While policy efforts—the RTTT initiative and No Child Left Behind (NCLB) waivers— have encouraged the incorporation of multiple measures of performance in newly developing teacher evaluation systems, practical constraints often result in evaluations based predominantly on formal classroom observations. In particular, more than two thirds of all teachers nationwide teach in grades or subjects that aren’t tested by state-mandated achievement exams, and therefore these teachers will not have VAM scores as part of their performance evaluation. Yet, there is limited knowledge of whether observational measures alone are able to identify effective teachers.

In a recent paper using data from the Gates Foundation funded Measures of Effective Teaching (MET) Study, Rachel Garrett, a researcher at the American Institutes for Research, and I examined the extent to which scores from classroom observations (in a context without formative feedback on teachers’ instructional practice) reliably identify effective teachers. We found that teacher observation scores varied from year to year to such an extent that it makes it a relatively weak tool for distinguishing teacher effectiveness. Our findings suggest that policymakers should be cautious about making high-stakes personnel decisions based solely on observation scores.

In a forthcoming overview of the evolution of educational accountability into so-called “new educational accountability,” which places accountability for student performance squarely on teachers, University of Connecticut’s Morgaen Donaldson and I shed light on a number of important differences between urban and nonurban districts in the design and implementation of teacher evaluation reforms. First, we found that principals and assistant principals in urban districts have a heavier observation burden than their counterparts in other districts. Specifically, urban districts require their administrators to perform more formal and informal observations, and they place more emphasis on post-observation conferences. While these requirements are more likely to produce the kind of robust observations that Lauren Sartain and I found to be effective in Chicago, the demands on administrators may be particularly challenging in urban settings, which often have fewer resources than nonurban districts.

Further, we found that teachers of tested and untested grades and subjects are evaluated quite differently. For example, up to half of a math teacher’s evaluation may be based on the standardized test scores of his or her students. However, for a teacher at the same school whose students might not be required to take standardized tests (such as social studies teachers),  the average achievement of the school’s students in tested grades such as math and reading will play an important role in this teacher’s performance evaluation. This is the case in all districts we studied, but more so in large urban districts—where schoolwide achievement measures play a larger role in the evaluation scores of teachers in subjects such as social studies. In such cases, factors largely out of these teachers’ individual control and disconnected from their everyday work will constitute a greater proportion of their ratings. In schools serving more disadvantaged students, such as most urban districts, the heavy reliance on lower school-wide scores will fail to acknowledge any of the contributions that these teachers make to improving student outcomes. This has implications for how successful urban districts will be in retaining their most valuable teachers.

When it comes to improving student achievement, particularly in large urban areas where the need for gains is so urgent, there is a new urgency about improving the overall quality of the teaching force. While early evidence shows the potential for well-designed, multi-faceted evaluation systems, other evidence finds that we should avoid burdening teachers with measures that fall short in distinguishing teacher effectiveness. Only through ongoing and rigorous study will we be able to refine these new systems in ways that inform policymakers of the ways evaluation reforms can enhance the overall quality of teaching and learning in our nation’s schools.

Matthew P. Steinberg is a Penn IUR Faculty Fellow and Assistant Professor of Education in the Education Policy and Teaching, Learning, and Leadership Divisions of Penn's Graduate School of Education. 

