Re-Envisioning Performance Standards Validation

Taking a Principled Approach to Evaluating Reporting Scales and Performance Standards when Modifications are Made to Assessments

For a variety of reasons—political, psychometric and practical—states are often required to modify their large-scale summative assessments. These changes may be significant, such as developing new assessments after the adoption of revised academic content standards, or minor, such as adding a couple items to an existing test blueprint. 

When significant changes are made to the content or design of an assessment, best practice dictates that a new score scale and new performance standards be established. These new standards signify the new test is measuring something fundamentally different and mitigates the likelihood that results from the new and old assessments will be used and interpreted interchangeably. 

Challenges of Minor Test Modifications to Assessments

When test modifications are minor, the required course of action is often less clear. For example, if an assessment is proportionally reduced across its blueprint by ten items, many would argue that the construct has not changed, so the existing scale and performance standards can be maintained. But what if an entire reportable category is removed, or the manner in which a content standard is assessed is altered to reduce testing time? Can the same argument be made? In these situations, bringing stakeholders together to review and validate the existing performance standards has become common practice. These stakeholder meetings, usually a modified version of a standard setting, are commonly referred to as standards validation.

The Role of Performance Level Descriptors 

Often, performance level descriptors (PLDs) serve as a litmus test for evaluating whether a new scale or standards validation is necessary. If a test is modified in such a way that the existing PLDs can no longer be supported, it is typically argued that, at the very least, educators should engage in standards validation to review and potentially modify the cut scores and/or the PLDs.  Unfortunately, this criterion does not capture the full range of test-based modifications that may jeopardize the appropriateness of an existing score scale or set of standards. Shifting from paper to online testing, implementing a non-proportional reduction in test length, and changing item formats, for example, may influence the construct being measured–even if the PLDs do not change. In addition, clear criteria do not exist to inform when these reviews should be done–or how to evaluate the effectiveness of the process in providing for more appropriate standards.

More importantly, the decision to conduct standards validation assumes the new and revised assessments demonstrate an appropriate degree of construct equivalence to maintain the existing score scale. This assumption is often poorly tested because the type and amount of evidence necessary to support it is not clear and can vary depending on score use, or because of practical constraints such as time and money. Even worse, sometimes the assumption is simply ignored under the inaccurate belief that the modification is so minor that standards validation will provide for any adjustments necessary to support continued use of the scale. 

Maintaining Reporting Scales and Performance Standards Amid Assessment Modifications

In this and future blog posts, we contend that standard validation must be re-envisioned as a broader process that involves identifying and reviewing a body of evidence that supports continued use of the scale and/or performance standards as intended. Depending on the nature of the modification and intended use of the score, a formal stakeholder meeting may or may not be part of this process. 

To ensure the appropriate evidence is collected, a state must articulate how it intends to use the assessment results and the claim(s) necessary to support that use. For example, if the goal is to maintain the existing scale and performance standards so new and old results can be interpreted and used seamlessly, then the primary claim is score comparability; specifically, that the results from the new and old assessment are interchangeable. This claim is the same as that which we address when we engage in equating across test forms and/or years; however, for equating construct, equivalence is typically not an issue because test forms are built to the same specifications.

In most cases, three categories of evidence will be necessary to support a claim of score comparability: descriptive, empirical and judgmental. Descriptive evidence consists primarily of procedural documentation explaining how and why a claim of comparability can be supported.  Empirical evidence includes the results of studies and analyses implemented to demonstrate construct equivalence.  

Judgmental evidence includes feedback from content and technical experts that the modification has not significantly influenced what is being measured. A few examples of each category of evidence are provided below: 

Descriptive EvidenceEmpirical EvidenceJudgmental Evidence
Documentation that the assessment was modified in a thoughtful and deliberate manner to support maintenance of the existing scale and performance standards.the conditions under which the assessment is administered have not changed.the materials and resources necessary to administer and score the assessment and interpret and use the results have not changed.Analyses which demonstratestrong correlations between results on the original and modified assessment for all students and particular groups of students.consistent performance classifications between the new and modified assessment.that the internal structure of the assessment  has not changed.strong measures of item and person fit.Expert feedback to confirm thatthe existing PLDs are appropriate for the modified assessment.performance at the cut scores on the modified. assessment still reflects the PLDs.provided empirical evidence supports continued use of the scale.

The amount and type of evidence needed to support the claim of comparability and justify continued use of the reporting scale and performance standards will vary based on both the type of modifications made to the assessment and the intended uses of the results. For example, in the relatively simple case of shortening a test while proportionally maintaining the test blueprint, there may be little need for judgmental evidence if descriptive evidence is strong. 

In other cases, such as dropping a performance task, eliminating item types, or converting from paper-and-pencil to computer-based testing, a substantial amount of evidence may be necessary and preferred to support the desired claim. Regardless of the nature of the modification, if evidence of comparability is sufficient, the existing cut scores can be considered “validated” because the meaning of the scale and performance standards have not changed.  

In this post, I have briefly outlined a process for standards validation when a test has been modified and a state wants to maintain the same reporting scale and performance level cut scores. In some cases, however, a state may wish to maintain one or more existing performance standards–even after a new test and score scale have been introduced. For example, a state may wish to compare the percentage of students achieving the state-defined college and career readiness (CCR) benchmark on a new and old assessment. In this case, the claim to be supported is that the procedures and evidence used to establish the CCR performance standard are similar enough across the two assessments to allow for this benchmark to be interpreted in a comparable manner. The standards validation process could include having educators establish the CCR standard on the new assessment and evaluate the degree to which the operational definition of CCR is consistent across the two tests. 

In an upcoming post, my colleague Leslie Keng will discuss a formal process and methodology for identifying, collecting and evaluating evidence of comparability that support standards validation in light of these different types of use case scenarios. Stay tuned!