Peer Review: Can Emerging Programs Meet the Requirements?

Challenges for Through-Year and Other Innovative Test Designs

If you had asked me last year whether the U.S. Department of Education’s (USED) assessment peer review would work for emerging assessment programs like through-year, my answer would have been no. Turns out I was wrong.

I thought that, at a minimum, the peer-review process would have to be updated with additional critical elements like those introduced for locally selected, nationally recognized high school assessments. But after I helped organize and lead the peer-review focus area of USED’s recent state assessment conference, my answer is now yes

In fact, with careful nuance, a submission for these kinds of programs can meet the current peer-review criteria, including all of the possible through-year program designs currently under consideration (See Dadey & Gong, 2023 for a deep dive, and Dadey et al 2023 for a higher-level summary of key considerations for through-year assessment). 

I’m emphasizing careful nuance here, because like everything else in peer review, the devil is in the details. Although my answer to the guiding question is now yes, it’s really a yes, but… and the buts require careful unpacking and consideration. 

In this blog, I’ll provide context about peer review, then turn to some conceptually simple—but perhaps logistically challenging—recommendations for improving the peer-review process that help address some of these buts.

Making the Case for a Through-Year Program in Peer Review

Crafting a peer-review submission involves helping peers understand an assessment program and connecting that understanding to the quality criteria—referred to as “critical elements”—in the peer review guidance. Making these connections can be quite direct for a typical end-of-year summative assessment program, but it requires much more substantial work when the assessment program is atypical. 

For example, if an assessment program uses the results of multiple administrations to create annual determinations, the submission will need to articulate why scores are being created this way, how the scores are being created, what inference those score support and, importantly, how potentially novel evidence meets the critical element. In doing so, a submission may need to redefine how evidence should be understood (such as providing an alternate definition or procedure for measurement precision). 

States that submit these kinds of programs will have to put in some careful and nuanced work to determine how to approach key critical elements, collect the right kinds of evidence, and present that evidence in a way that builds and supports the peer reviewers’ understanding. Not every critical element will require novel conceptualizations and evidence, but some will. 

In the case of programs that create summative determinations based on multiple assessment administrations, critical elements that will likely require substantial, novel work include: 

  • Test administration (2.3)
  • Overall validity (3.1)
  • Reliability (4.1)
  • Scoring (4.4)
  • Inclusion of students with disabilities (5.1 and 5.2)
  • Reporting (6.4)

Other elements will also require novel work, like test monitoring (2.4), but much of that work would be covered by the elements listed above. 

It’s worth noting that peer review applies only to the components of the program that are used to produce summative scores. Some states call their programs through-year but only use results from a single administration toward the end of the year. In those cases, the peer review would not differ substantially from that of more-typical state programs. 

Example: Using Critical Element 2.3, Test Administration

What might this substantial, novel work look like? Consider critical element 2.3, test administration, as an example. This element involves ensuring that a state develops and implements a consistent procedure for standardized test administration. 

Suppose we have a through-year program composed of a set of assessments, each of which assesses a small part of the content domain, such as a standard. Suppose further that these assessments are administered based on educator judgment, once an educator decides that a student is ready. This hypothetical program is among the most extreme designs for a through-year program; most programs are not this extreme. 

Can such a program meet critical element 2.3?

I argue that it can. To do so, a state would have to describe the administration process in detail, including the allowable variations within that process. Doing so allows a state to move away from the “every student gets a parallel form of the same test within a given assessment window” style of administration. Instead, a state would need to ensure that:

  • It has defined the process for identifying when students have had sufficient instruction to be assessed, and has documented that process and communicated it to educators
  • Educators have received training on this identification process
  • Educators implement this process with fidelity
  • It monitors the administration for patterns that suggest educators are not implementing the process with fidelity (e.g., that no tests were given during a semester) or that patterns inappropriately vary across student groups 

If these kinds of steps are executed well, the state would have a coherent argument about how its administration process ensures a consistent administration that supports its intended score interpretation.

Changes Needed in the Peer-Review Process

Until this point, I’ve painted a fairly rosy picture of what is possible. But what is possible in theory and what actually plays out in practice can be quite different. Getting to the “substantially meets” designation can be a slog that often requires numerous resubmissions, to meet what can feel like constantly shifting goalposts. 

I recommend at least three improvements to the peer-review process. USED should:

  1. Add unique evidence about through-year to the current peer-review guidance, in the same way it’s added text for CAT assessments and Alternate Assessments of Alternate Achievement Standards (AA-AAS). This change would provide much-needed details to reviewers about how the element should be understood in light of the unique aspects of specific applications of assessment. This added text is needed only for the elements that pose the most unique or substantial challenges. 
  2. Increase opportunities for consensus-making and training to address novel programs. The peer-review process involves small teams working individually on a few states’ submissions to produce a set of completed peer-reviewer notes. This approach provides virtually no opportunity for consensus-making across the reviewer teams. Building in consensus-making could resolve a number of issues often reported about peer review, such as the variability in reviewer judgements across states despite very similar evidence. 
  3. Recruit reviewers that have deep expertise in novel programs and ensure that they are included on panels reviewing these kinds of programs. 

These changes are especially important in light of the complications of through-year assessment programs. These kinds of processes do come with a cost, but if they save on repeated resubmissions of a program, they may well be worth the price. 

The Road Ahead

I’ve focused on assessment programs that use results from multiple administrations to create summative scores. But these programs are actually fairly rare, since most—even those that go by the name through-year—have opted to use only end-of-year results. 

The peer review for these programs, such as those in Alaska, Nebraska, Maine, and Virginia, will likely be quite similar to those of traditional end-of-year programs. Programs that are considering or planning to use multiple administrations, like North Carolina’s or Montana’s, likely face a much longer timeline. When—and if—they do undergo peer review, it will be highly instructive for all future programs.