Changing Expectations for Fairness in Testing

Where we’ve been and where we’re going

Do you ever encounter a test, see the questions or the score report, and then scratch your head and ask, “How can anyone consider this test to be fair?”

Maybe two students who typically show similar performance on classroom tests get wildly different scores on their end-of-year state test. Maybe a college-entrance practice question strikes you as wrong or misleading. Like any refereed game, we want to believe that scores are trustworthy and provide a solid assessment of student performance.

As testing professionals, we often accept “technically sound tests” as “objective measures” without much question, even though we understand that test development processes can introduce threats to fairness. We know that test items can be biased, that tests can raise unfair barriers to students’ demonstrating what they know, and can be misaligned with what students are learning.

Looking to Standards of Practice

The testing industry has evolved in how it defines, understands and evaluates fairness. The Standards for Educational and Psychological Testing are the field’s north star for guidance on testing. They direct developers to ensure quality work by explicating best practice based on theory and research.

These Standards matter because they guide consequential decision-making for test developers in the creation, administration and evaluation of tests (such as your state’s accountability tests, the SAT or the ACT). The Standards have even influenced the assessment provisions of federal and state laws, including the latest two reauthorizations of the Elementary and Secondary Education Act.

Where We’ve Been Going

A revision of the current Standards is under way. But to gauge how the field has evolved on fairness, I compared the last two versions to see what they said. This exercise highlighted a few important points about our trajectory over time. It also helped me formulate some thoughts about how to better ensure that tests are fair.

The 1999 version of the Standards defined fairness in testing in terms of what was feasible for test developers. It frames fairness as rooted in aspects of “value and public policy [that] are crucial to responsible test use” (p. 80). It goes on to say that there is “limited or no consensus” about what fairness in testing actually is, and therefore the issue of fairness is primarily outside test developers’ scope of work.

The 1999 version assigns limited responsibility to test developers for test fairness beyond the “customary” technical expectations of psychometric design and execution (e.g., reliability, comparability, evaluation, documentation). This is like saying, “We build our product and other people must decide if it is fair for their purposes.”

Importantly, the 2014 Standards expands test developers’ responsibility for making fair tests. It includes fairness as a foundational topic, along with validity and reliability. This edition defines a fair test in three primary ways:

All test takers should test on the same test constructs. The test construct is the concept or characteristic that a test is designed to measure. For example, if we are testing math, we don’t want to inadvertently make the test so language-heavy that we are actually testing reading.
Tests should produce scores that mean the same thing for all test takers. Again, we don’t want to test reading on a math test because it will unfairly affect specific groups, like those with learning disabilities or English learners.
Tests should not help or hinder test takers because of personal characteristics that are irrelevant to the intended construct. In other words, tests should present no barriers that specifically affect a subset of test takers. For instance, the language demands of a math test present unfair barriers to some students and not to others.

You don’t have to be a testing professional to see that the definitions of fairness in testing evolved significantly between the 1999 and 2014 versions of the Standards. Most importantly, the 2014 version defines fairness in testing as reaching beyond customary psychometric considerations, and it assigns additional responsibility to test developers for ensuring that tests are fair.

The Next Chapter: Fairness in Testing

Since the Standards are so influential and a new version is under construction, this comparative exercise led me to think about how the new Standards could continue the trajectory toward an even deeper understanding of what constitutes a fair test.

Here are four recommendations:

Test developers must be responsible for transparent fairness arguments. If we claim a test is fair, we need to show it. For example, if we say the test produces a score that allows us to compare reading performance across a group of students, we’d better be able to defend that claim with evidence of fairness. As they do with validity arguments, developers should answer questions like these: How do we know a test is fair to all test takers? How do we know that test takers and score users understand the meaning and intended interpretations of scores?

Test developers should provide transparent disclosures of what test scores communicate. To argue a test is fair without clarifying what the scores mean provides opportunity for the inadvertent – or even willful – use of scores for unfair purposes. Test developers also need to disclose any missing evidence of fairness so that score users are aware of any risks.

Test developers should offer explicit evidence of how psychometric evaluation operates in service of fairness, not just of validity. This can be tricky, since validity is usually understood through evaluations of prior scores or factors that predict future performance, whereas fairness evidence may address how these same predictions were interrupted or changed for students over time. But all of this evidence is important.

Fairness should be seen as the subject of ongoing evaluation. This is particularly important to me. We need to acknowledge that we don’t know what we don’t know. We need to include broad and representative groups of people in the test development process so that we bring the best collective understandings to the table. We need to allow for discussion and inspection of our fairness arguments and promote ongoing improvement. We need to dig into questions of fairness, always seeking new ways to evaluate our work. And we need to envision analytical approaches and innovate methodologies that identify these areas for improvement.

The time is ripe for lively dialogue about fairness in testing. Researchers and practitioners—as well as policymakers and the general public—have the role and responsibility to ensure fair tests for all students. We need to work together to do so.

Anne H. Davidson is the founder of CrescendoEd LLC.