A Framework for Reporting Technically-Sound and Useful Subscores on State Assessments
This is the fifth in a series of CenterLine posts by our 2019 summer interns and their Center mentors based on their project and the assessment and accountability issues they addressed this summer. Victoria Tanaka, from the University of Georgia, worked with Chris Domaleski on a review of the reporting of subscores on states’ large-scale assessments.
Many assessment programs report both a total score and subscores on their tests. While the total score provides an indication of how well the student performed overall, subscores are intended to provide specific information about skills measured by subsets of items.
Stakeholders generally like subscores because ostensibly they provide a more detailed portrayal of student performance. However, subscores often aren’t very clear, precise, or reliable, which can create tension between reporting assessment results that are technically defensible versus giving test users detailed information thought to inform educational policy and decisions.
To help address this tension created by subscore reporting practices, I assessed the range of ways subscores are applied today with the goal of developing a path toward improved subscore reporting, use, and effectiveness.
Taking a Closer Look at Subscore Reporting Practices
To better understand the range of current practices in reporting subscores, I started by collecting information about subscore reporting from state testing programs. Third through eighth grades were separated from high school grades 9-12, as students receive different summative assessments in high school than they receive in earlier grades.
I retrieved sample individual student score reports for each state and the District of Columbia, which are representative of each state’s large-scale summative assessment for English language arts (ELA) and mathematics. I examined an equal number of reports for Grades 3-8 and high school, though a report was not retrieved for every grade level.
The information collected is organized according to the following taxonomy: metric, uncertainty, and interpretation.
Generally, there are four metrics seen in practice for communicating subscores which are displayed in the graph below. These metrics include raw score, percent correct, performance categories, and scale scores. It should be stressed that some states report subscores using more than one metric.
With respect to reporting uncertainty, or measurement error, as displayed in the graph below, I found most states do not provide this information on score reports. When states do report subscore error, they either do so via a graphical representation, such as error bars, or a qualitative description that might take the form of a few sentences in the individual student report or score interpretation guide.
Interpretation and Use
Finally, I examined how states describe potential subscore interpretations. The prominent methods were:
- Classification with respect to state achievement standards. Subscores relate student performance to state expectations, such as state achievement levels (e.g., Basic, Proficient, Advanced).
- Classification with respect to an external criterion. Subscores relate student performance to expectations beyond those of the state. For example, the score indicates whether the student has met a benchmark for college and career readiness.
- Comparison of relative performance. Subscores relate student performance on the subset of items to the individual student’s overall test performance, most often to identify areas of strengths and weaknesses.
A Framework for Selecting Defensible Subscore Reporting Practices
A final and central part of this project was to provide a framework for reporting subscores with a strong evidence base. There are three primary components to the framework: interpretation, evidence, and communication. Interpretation refers to the three primary intended subscore interpretations identified in the taxonomy described above:
- classification with respect to performance standards,
- classification with respect to external benchmarks or norms, and
- comparison of relative performance.
The communication methods are also described in the taxonomy:
- the type of metric used,
- reported uncertainty, and
- any other additional qualitative information that was provided.
Importantly, the framework also includes evidence necessary to support the intended interpretation and uses.
First, there are three “cross-cutting” pieces of evidence that are important for all of the interpretations discussed:
- The measure is sufficiently reliable or precise. If subscores are to be reported, they must be sufficiently reliable or precise. This requirement comes from both the 2014 joint Standards for Educational and Psychological Testing and Assessment Peer Review Guidance. Evidence might include measures of internal consistency or reliability studies.
- The structure is distinct and meaningful. The subscales must represent distinct, meaningful scales to produce useful subscores. The test developer should demonstrate the uniqueness of the information provided by the subscales. Evidence might include a factor analysis.
- There is sufficient content representation to support the interpretation. As is expected with the total test, subscales should be a good sampling of the content being tested. Evidence might include detailed blueprints for the subscales or expert review of the subscales for range and depth of alignment.
Next, I propose some additional evidentiary requirements that pertain to some specific interpretations. When the subscores are intended to classify students, either with respect to performance standards or an external benchmark, the following evidence must be provided:
- The classifications meaningfully differentiate performance. When subscores are intended to help classify students, then it is essential that these subscores are capable of meaningfully differentiating between high and low performers. Evidence of this differentiation might include a classification consistency study or demonstrated measurement precision at each performance level (e.g., Consider the conditional standard error of measurement. Does it overlap at different performance levels? It may represent noise if it does.)
Finally, when the subscore is reported for the purposes of classification with respect to an external benchmark or norm, the test developer must provide one additional piece of evidence:
- There is a demonstrable relationship to the external benchmark. Subscores that will relate back to an external benchmark should relate to that benchmark for the subscores to be reasonable. Evidence might include a prediction study with composite and/or additional variables or outcomes (such as specific benchmarks), or expert analysis of subscore content with respect to the claims covered by the subscales.
The full paper we are producing for this project will include several examples to illustrate the utility of the proposed framework and to demonstrate some suggested appropriate (and inappropriate) subscore reporting practices.
Overall, the goal of the project was to help test developers and users better understand how to improve the clarity, utility, and credibility of subscore reporting practices. By considering the range of practices described in the taxonomy and using the guide provided by the framework, test developers and state policymakers can build a solid body of evidence to more effectively support subscore reporting and use.