Using Reference Data in Standard-Setting

Considerations for a key step in test revision

Standard-setting—the process of establishing cut scores for each performance level on a test—is an important part of any state assessment program. States must consider setting standards when they launch a new test, revise their content standards, or make other changes that could influence assessment comparability. With each new standard-setting effort comes a question: Should we use performance data from other assessments to inform the process?

It often feels intuitive to include reference data—ACT/SAT benchmarks, NAEP results, or prior state assessment data—in a standard-setting process. But doing so without a clear purpose and strategy can confuse panelists and threaten our ability to draw accurate conclusions about what students know.

It’s not enough to say, “Here’s some more information for you to consider when making your rating.” We must be deliberate about what data are provided and the role they should play.

States and their assessment partners must address three fundamental questions when they’re considering introducing reference data:

1. What information, if any, should be provided?

2. What impact should it have on panelists’ recommendations?

3. How can it be incorporated into the process?

What Information Should Be Provided?

The goal of setting standards is to ensure that a state’s policy goals are appropriately translated to the reportable score scale. Consequently, any feedback provided to inform the process, such as benchmarks, item difficulties or panelist ratings, should be in service of this goal.

Policy descriptors vary in how well they define a state’s priorities, but most include some combination of information about the target of measurement, how performance should be interpreted, and implications for the future, as shown in the chart below.

	Descriptor (target of measurement, interpretation, implications for future)
NAEP Proficient	This level represents solid academic performance for each NAEP assessment. Students reaching this level have demonstrated competency over challenging subject matter, including subject matter knowledge, application of such knowledge to real work situations and analytical skills appropriate to the subject matter.
Arkansas – Proficient	Students demonstrate a proficient understanding of knowledge and skills and show mastery of grade-level standards. These students are on-track for college and careers and demonstrate readiness for content at the next grade/course.
Texas – Meets Grade Level	Performance in this category indicates that students have a high likelihood of success in the next grade or course but may still need some short-term, targeted academic intervention. Students in this category generally demonstrate the ability to think critically and apply the assessed knowledge and skills in familiar contexts.

Because of how they are structured, policy descriptors can help inform decisions about what reference data to include—and why. A state like Arkansas, which defines “Proficient” as being “on track for college and careers,” might consider incorporating ACT or SAT benchmarks into its standard-setting process to monitor alignment with that definition.

Similarly, if a state is setting a performance standard to identify students at risk for reading difficulties in early grades, it may find it useful to reference data or benchmarks from other assessments designed for that purpose. NAEP is also commonly used because it is considered the gold standard and an objective measure of student performance in grades 4 and 8. Differences in NAEP and state summative proficiency rates are often referred to as the “honesty gap” (even though these results are not strictly comparable and may differ for valid reasons).

Contextual factors can also influence decisions about the additional data a state might present during standard-setting. For example, if the content assessed by the old and new assessments are similar, providing historical impact data or the “old” proficiency cut may help panelists evaluate and refine their recommendations.

By the same token, if the state agency knows its constituents will find it unacceptable to have proficiency rates that are far lower than those observed on NAEP or the previous state summative assessment, information about these tests should be incorporated early in the process.

In most cases, decisions about whether reference data should be included are not psychometric. There is no technical reason to include any of this information in the standard setting process if the only goal is to support criterion-referenced interpretations of student performance. These are policy decisions that should be made with an understanding of the desired impact and the potential risk.

What Impact Should It Have on Panelists’ Recommendations?

The considerations I’ve outlined above should be addressed in advance, so states can design a process that enables the reference data to have the appropriate/intended influence. Factors that influence the impact of reference data include:

Timing: Data introduced early may shape panelists’ initial thinking, while data shared at the end may be used cautiously or not at all (e.g., if it is difficult to give up previously held positions).
Framing: The data will likely have a greater impact if the facilitator highlights their alignment to the state’s policy priorities, rather than framing it as just another piece of information.
Presentation: The way reference data are shared can greatly influence their impact. For example, presenting NAEP impact data side by side with impact from the panel’s recommended cut scores may cause more dissonance than using embedded benchmarks or predicted threshold regions to inform the rating process.

Careful attention to these design factors can mitigate panelists’ discomfort when reference data contradicts their recommendations.

How Can Reference Data Be Incorporated Into the Process?

The table below provides examples of common strategies for embedding reference assessment data into standard setting:

Strategy	Example
Present the location of a benchmark or standard from the reference assessment on the scale of the new assessment	States can convert ACT/SAT benchmarks into estimated scale scores on the state test to help evaluate whether cut scores align with college readiness claims.
Present impact data from the last administration of the previous state assessment	States can present data from a previous administration to inform discussions about how/if the impact associated with the new test aligns with changes in test expectations.
Identify the location of the recommended cuts on the scale of the reference assessment to show how much they differ	States may use linking procedures to identify where their 4^th grade proficiency threshold would fall if mapped to the NAEP scale to illustrate the differences in expectations.
Present information about how performance on the reference test relates to performance on the new assessment	States can embed released NAEP items in their operational test to enable direct comparisons of student performance on common items.

Standard-setting is a judgment-based process. Reference data, when thoughtfully selected and well-integrated, can support that judgment. The strength of a standard-setting process lies in its clarity of purpose, alignment to policy, and fidelity to both evidence and professional judgment. Done well, the inclusion of reference data can enhance all three. Done poorly, it can threaten the validity of the process and the results.

Photo by Allison Shelley for EDUimages