Current Assessments May Require New Equating Methods and Procedures to Ensure Results will be Comparable Across Test Forms
One of the fundamental tenets of equating two tests is that it should be a matter of indifference to an examinee as to which of the two tests they take.
When two or more students receive the same scale score on their state’s Grade 3 reading test, we want to be able to make the same inferences about their performance, even if they responded to almost entirely different sets of items.
Labelled the equity requirement by Holland and Dorans (2006), we refer to it as a tenet rather than a requirement because, while the interchangeability of test forms is an assumption that drives education policy, it has always been known (by the measurement community, if not fully understood by policymakers) that interchangeability is an ideal we always strive for, but never fully achieve in practice. We can and do, however, apply research and best practices to test design, development, administration, scoring, psychometric analyses, and reporting to ensure that test forms are as close to interchangeable as possible and that the intended interpretations and uses of the test scores they produce are supported.
The Challenges of Measuring Student Performance when the Focus is Proficiency
When the focus of assessment is the classification of individual student performance into achievement levels based on a single assessment, we cannot claim with absolute certainty that a student would have achieved the same score if she or he had received a test form with a different vocabulary item or one that included a passage about Beyoncé instead of Beethoven. Those differences are just one reason why the field has established standards and endorsed guidelines for the use of assessments for high-stakes decisions for individual students.
Andrew Ho and others have described the problems associated with assessment and accountability policy focused solely on the percentage of students classified as “Proficient” on a state assessment. Large changes in performance from one year to the next might result in small changes or no changes in the percentage of students meeting the “Proficient” cut depending on where the cut is in relation to where most of the students are scoring. Conversely, small changes in the distribution of student performance could lead to a large change in the percentage of students meeting the “Proficient” cut if many of the students are scoring near it.
The issues associated with a distribution of students centered near the “Proficient” cut is exacerbated when the equating process results in gaps in the scale scores that can be achieved on a particular test form. Most test users are familiar with scale score tables, such as the one below, in which on Form B, two raw scores are associated with a single scale score of 199 and no raw score is associated with a scale score of 200.
In this example, Form A and Form B have been carefully constructed to be equivalent in difficulty, and equating procedures have accounted for any minor differences in difficulty that do exist. As we would expect, there would be virtually no difference in mean scale score if either form were randomly assigned to students in a state.
Similarly, there would be no difference in the percentage of students classified as “Proficient” if the cut were 198, 199, or 201. There would, however, be a 5-point difference in the percentage of students classified as “Proficient” on Form A and Form B if the cut were 200.
Adapting Approaches To Equating To Meet New Assessment Demands
The example above is one illustration of how the assessment and accountability requirements of No Child Left Behind put new demands and strains on our best practices in equating test forms. The need to equate multiple forms of the same test within a year, and then equate that set of forms with another set of forms across years to make comparable achievement level classifications on each of those forms was a formidable challenge that we were able to meet well in most cases.
It was important, however, that we understood and were able to explain cases such as the one described here in which equating alone was not sufficient to ensure that it was a matter of indifference which test form was administered.
We are moving rapidly into a next generation of large-scale assessment in which there are new challenges that we must overcome or at least understand:
- The shift from paper-based to computer-based testing (including testing across different platforms and device types) resulted in differences in results at some grade levels and tests that cannot be accounted for simply through equating.
- As more states use college admissions tests as high school state assessments, careful consideration is needed to determine whether established practices for equating alternate test forms are still appropriate when state results used for accountability are based largely on a single form.
- The increased availability of long, short, and other alternate forms of assessments to meet concerns about testing time or costs require a renewed focus on established knowledge about linking test forms.
- A renewed interest in matrix sampling and performance-based assessments to assess complex standards such as the Next Generation Science Standards renews challenges that were set aside during the No Child Left Behind era, such as field testing complex performance assessments and applying the appropriate psychometric procedures to support both group-level and individual results.
Perhaps most importantly, as computer-adaptive testing and the use of items licensed from existing item pools to create multiple forms of pre-equated tests that are administered on demand become the norm, we have to keep our guard up to avoid the risk of being lulled into the belief that test equating can be automatically ensured through item calibration.
We must avoid the temptation to treat test equating itself as a matter of indifference.