Comparability of Scores on the Same Test

In 2018, the Center was honored to be invited by the National Academy of Education to contribute two chapters to the book Comparability of Large-Scale Educational Assessments, which was released earlier this year. The Center’s chapters addressed the foundational issues surrounding the comparability of individual and aggregate group scores when students ostensibly take the same test. Charlie DePascale and Brian Gong authored the chapter, Comparability of Individual Students’ Scores on the “Same Test.” Leslie Keng and Scott Marion, with contributions from Susan Lyons, authored the chapter, Comparability of Aggregated Group Scores on the “Same Test.”  In this post, we share a summary of the chapters presented by Brian and Leslie at the virtual 2020 conference of the National Council on Measurement in Education.

When we administer K-12 large-scale assessments, we expect to make comparisons among scores. For individual student scores, we most often compare the performance of an individual student to a fixed standard, such as an achievement level cut score or a passing score.  Other common comparisons include comparing the performance of two or more individual students or comparing the performance of the same student at two points in time. For groups such as districts and schools, the comparisons we make generally fall into four major categories:

  1. Monitor population trends and patterns.
  2. Compare subgroup performance at specific time points and over time.
  3. Evaluate curriculum, instruction, and interventions or other programs.
  4. Make accountability decisions regarding districts, schools, or teachers.

The discussion of comparability of individual student scores on large-scale assessments such as K-12 state assessments begins with the interpretation of a single student’s score. It is expected that a student would receive the same score, supporting the same interpretation or inferences, if they took a different form of the test and/or took the test under permissibly different conditions. That is, individual student scores on the “same test” are expected to be interchangeable. Fundamentally, comparability is a question of validity, reliability/precision, and fairness.

What Do We Mean By the “Same Test”?

There are obvious challenges in trying to make comparisons of scores from different tests such as scores on the ACT and SAT or scores from a commercial interim assessment and a state summative assessment.  Comparing scores on the same test would appear to be a much more straightforward task. In K-12 large-scale assessment, however, it is rare to have the same set of items administered to all students at the same time, under the same testing conditions. The term “same test” can refer to a wide variety of configurations of test items and testing conditions.  In the table below, we list test forms that would generally be regarded as the same test as the state’s 2019 Grade 8 Reading test as well as some examples of test forms that fall in a gray area of being the same test and test forms that are definitely not the same test. 

Designing, administering, scoring, and reporting results of test forms intended to be the same test and produce comparable results requires the thoughtful combination of design decisions and the application of psychometric procedures. Attaining comparability, both at the individual and group levels, requires a deep understanding of the relationships among the specific learning target(s) being measured (i.e., the construct), the standards, and the assessment to understand what deviations from the “same test” are likely to impact the measurement of the construct and comparability of student scores (e.g., non-standardization: test specifications, timing, use of accommodations, and scoring and equating procedures).  It is not possible to attain comparable scores simply through either design decisions or psychometrics.
 

Threats to Comparability

Even when it appears that a testing program has all taken all of the steps necessary to build, administer, and score test forms that will produce comparable results, there are still significant threats to comparability that must be considered. Common threats to comparability in K-12 state assessment programs and examples of each type include:

  • Altered purposes: The state begins to use (or stops using) the test for teacher accountability or student accountability.
  • Altered design specifications: To reduce cost or testing time, the state replaces an essay or performance task with selected-response items. The state introduces new item types or tools unfamiliar to students.
  • Non-“standardized” test administration: Over time, the variety of devices on which students take the test continues to expand.
  • Change in assessment contractor: The selection of a new assessment contractor results in subtle or major changes to the test delivery platform, scoring procedures, test security, or psychometrics.

One threat to comparability that is particularly relevant at this time is students’ opportunity to learn the material that is tested.  Opportunity to Learn (OTL) comprises a variety of factors related to the quantity and quality of student access to appropriate curriculum and instruction.  Although there is general agreement that OTL affects student achievement, its impact on the comparability of scores is dependent on the claims that are being made about that student achievement.  For example, even in the case where students clearly had disparate OTL, their test scores might still be considered comparable if the claim is that the test score describes the student’s current level of achievement in the content area.  However, the same test scores obtained under the same conditions would not be comparable for the following claims:

  • The test score reflects the achievement of students after they have received instruction in the content area being tested.
  • The test score reflects the level of achievement students can attain if they have had an adequate opportunity to learn the material.
  • The test score reflects what students could achieve at the next grade level or in college if provided an adequate opportunity to learn. 

An additional threat to the comparability of aggregated group scores over time is a significant change to the sample or population tested, which results in changes to the composition of the groups being compared. The example presented in the two tables below demonstrates how an apparent improvement in school scores from one year to the next may be due to changes in the school’s enrollment rather than changes in school performance.

Evaluating Whether There is Sufficient Comparability

The level of comparability needed in individual or aggregate test scores varies based on the claims being made and the stakes associated with those claims. It is a matter of policy and professional judgment. As stated by the National Research Council Committee on Equivalency and Linkage of Educational Tests,

“[U]ltimately, policy makers and educators must take responsibility for determining the degree to which they can tolerate imprecision in testing and linking… and responsible people may reach different conclusions about the minimally acceptable level of precision in linkages that are intended to serve various goals.” (NRC Committee on Equivalency and Linkage of Educational Tests, 1999, p. 4)

Professional standards, tools, criteria, and traditions of professional practice exist to help test developers, test users, policymakers, and educators make informed decisions regarding test score comparability.  As a starting point for states planning a K-12 large-scale assessment program, we offer the following Framework for Supporting Comparability Claims.

Comparability of Scores on the Same Test

Summary

Designing for and evaluating the comparability of test scores is central to the Center’s work with large-scale testing and accountability systems. This volume will be an important reference for testing providers, state leaders, and others concerned about the comparability of educational test scores.

Digital copies of individual chapters or the entire book, Comparability of Large-Scale Assessments: Issues and Recommendations (Eds. A. Berman, E. Haertel, & J. Pellegrino) may be downloaded at no charge from the National Academy of Education website.

Share: