The Need for Program Evaluation to Support Accountability Implementation

Jun 28, 2018

Even the Best Designs Still Have a Burden of Proof

Accountability systems are supposed to incentivize behavior that promotes equity in educational opportunity and leads to positive student outcomes. But how do we really know? Even the best designs still have a burden of proof. Applying program evaluation principles that use school identification are powerful tools to examine accountability’s impact, usefulness, and relevance. Program evaluation facilitates the collection, use, and interpretation of the right information to improve or understand a system or its impact. 

Does this “Accountability Information” actually do more than help real estate aggregators rank order school districts when house hunting? Enter a common research holy grail: I can claim that this program has had a beneficial and lasting impact on my problem. This requires that we know about the context, the people, and their on-the-ground needs. Let’s consider an example that many have experienced: car trouble. 

Selecting the right intervention is a lot like trying to diagnose a car’s problem. Let’s say we notice a strange grinding sound coming from the outside of your car as you approach a stoplight? Immediately, you start trying to pinpoint the source of that noise. Then you might add variables that help you better describe the noise. But how many variables are too many? Is the sound there only when you come to a stop? Is it always making a noise? Is the noise getting worse over time? Maybe it only happens when you turn your wheel to the right or left? 

Depending on your mechanical prowess (or desire for it), you may simply want to be able to describe it to a subject matter expert (e.g., mechanic) so you can save them some time (and you some money). If you’re the one attempting to make the repair, then testing and diagnostic approaches might be more thorough. After all, you want to be sure you buy the right parts. Either way, you need sufficient information to feel confident in the diagnosis. Correct diagnoses should result in relevant remedies. If the diagnosis is accurate, then you can confirm that the remedy worked. If the remedy did NOT fix the problem, then either your diagnosis was wrong, you picked the wrong remedy, or both. 

Educational interventions can be similar with regard to pinpointing and targeting the source of a problem. Let’s say we have identified a need due to chronic underperformance (e.g., Comprehensive and Support Intervention schools under ESSA). We want to use the latest evidence-based strategy that seems to solve this problem. Now we implement, monitor, and wait, right? Wrong. We need to really understand and diagnose what the real problem is , connect those needs with relevant strategies or behaviors, and then dig into the detail. There are many templates for collecting the information (e.g., comprehensive needs assessments), but few for how to interpret it.

We start with the usual suspect questions: Did the strategy work? If it worked, why? Would it have worked if we had done something differently? Can we replicate this in other settings? If it didn’t work, was our diagnosis wrong? Maybe our diagnosis was right, but we forget to really understand the context and changed too much of the strategy.

Program evaluation methods are at the heart of how these questions interact with the problem. Good evaluation leverages the right methods to understand design, delivery, implementation, and results. Evaluating our decisions helps us determine whether we’re seeing a change in the signals based on our program or strategy, or if improvements are just noise

There appears to be snowballing interest in evaluating accountability and improvement system design at the state level and rightly so. We have been applying top-down accountability pressure for a long time. The questions are shifting to how accountability supports improvement and how accountability system design can be leveraged to promote positive behavioral change beyond shame avoidance. I would argue the following logic chain: 

  1. Accountability identification is an efficient way of grouping schools to triage support.
  2. Number 1 only holds if the accountability system is identifying the “right” schools.
  3. Accountability data are proxies of school success: student readiness for what comes next.
  4. Student readiness beyond K-12 is a function of school and community behaviors and capacities that are hard to measure efficiently.
  5. It is difficult to make sweeping claims about school behavioral and capacity characteristics using only accountability data.
  6. Defining the right schools in number 2 means ratings reflect both accountability data characteristics and unmeasured behavioral and capacity profiles.
  7. Confirming the coherence between accountability (i.e., proxy) data and improved teaching and learning behaviors (behavior and capacity in #6) requires targeted uses of program evaluation.
  8. Evidence for number 7 can initially be based on whether accountability outcome change improves among schools who implement relevant and appropriate strategies.

Obviously, this chain includes a lot of inferential connections that are tough to test through large scale state tests. However, ESSA has given us an opportunity to test how well accountability design connects with improvement systems, especially in the case of CSI schools. 

States are on the hook to support CSI schools more intensely in order to accelerate gains in teaching and learning. How we deconstruct the links across identification, support, implementation, and outcome gains can be powerful evidence that connects the dots between identification and improvement. With a strong argument, those connected dots can serve as a template for other schools or districts looking for ideas, guidance, or even examples of how strategies help make gains. 

In the spirit of continuous improvement, I recognize that I’m simply laying out the nature of the problem and characterizing a general solution. My next blog will outline a more detailed solution that I hope can both reduce noise in confirming accountability design and presents strategies to examine improvement over time.