What Can Professional Football Teach Us About the Responsible Use of Tests?

Practical Advice to Avoid Measurement Fumbles

If those of us in educational measurement had a nickel for every time we said, “Don’t rely on single test scores to make important decisions,” we could all retire wealthy. This is especially true when it means that users of assessment data ignore other important information about the test taker.

We’re not alone. The Wall Street Journal carried a story in January about the S2 Cognition Test. Roughly half the teams in the National Football League have used the S2 for about seven years. It has certainly helped identify potential future stars, such as San Francisco quarterback Brock Purdy, who undoubtedly would not have been picked had he not scored as well as he did.

But that success has led many NFL general managers to rely primarily on the test score when evaluating players, even when their eyes were telling them something different.

The Journal story focused on rookie Houston Texans quarterback C.J. Stroud. Stroud had one of the best seasons of all NFL quarterbacks as a rookie in 2023. But he scored quite poorly on the S2. To be fair, S2 labeled his test “potentially invalid.”

It’s not as if Stroud came out of nowhere. He was a three-year superstar at Ohio State, so scouts had a good body of evidence about his abilities. Nevertheless, instead of being the top draft pick, Stroud was picked second by the Texans. Thankfully, the Texans’ general managers looked at Stroud’s full record in deciding to keep him as a top pick. While being picked second is still impressive, there is often quite a difference in salary between the #1 and #2 draft picks, so Stroud’s low score on the S2 likely cost him some money. That said, his amazing performance this year casts doubt on the credibility of the S2 as a rock-solid indicator of future NFL success.

Perhaps this sounds familiar. In education, we don’t have to look far to see many instances where using test scores for important decisions is contentious. In this blog post, we’ll highlight some key measurement lessons drawn from the article and from current—and longstanding— issues in educational assessment. The italicized statements that follow are drawn from the Journal article.

The Reification of Test Scores

The debate over the S2’s value also strikes at the core of one of the most lucrative questions in professional sports. In an industry increasingly flooded with data, a test that can precisely calculate athletes’ capabilities is a panacea. But it isn’t yet a reality.”

Stephen Jay Gould famously wrote about the reification of IQ scores in The Mismeasure of Man. He described how early educational psychologists were so excited about their newfound success measuring some aspects of cognitive ability that they believed they were measuring everything important to future success.

This belief is related to people’s seemingly unquestionable faith in numbers. While there are many quantities involved in judging potential football talent, such as 40-yard-dash speed, the judgment also involves observations of less easily quantified traits, such as the capacity to perform under pressure, situational awareness, and the ability to motivate teammates. But too often, quantities associated with S2 performance appear to be more valued than these important qualitative observations.

Of course, this picture is uncomfortably similar to what we do in education too often. Case in point: teachers’ expert observations of students are often given less importance than “official” state or interim test scores.

What Are We Measuring? Clearly Defining the Construct

But [Brandon] Ally [one of the S2 developers] cautions that success in the NFL doesn’t rest entirely on a player’s cognitive abilities. It’s just one factor in a much larger equation that includes everything from physical skills to mental toughness. S2 isn’t responsible for measuring any of those.”

Despite this caution, the S2 website makes many strong claims about its utility for predicting an athlete’s likely success. The qualities that define a successful NFL quarterback are multifaceted, and successful quarterbacks vary in their expertise in these multiple dimensions.

For example, Tom Brady was known for his ability to read the defense and release the ball extremely quickly, while Patrick Mahomes is able to buy time with his uncanny skill at escaping would-be tacklers. Both are widely considered elite quarterbacks. The point is that there are many qualities (or constructs in measurement parlance) that define effective quarterbacks. Relying on any subset of qualities underrepresents the construct and can mislead users into thinking they have more information than they really do.

Measurement Error is Always Lurking

Let’s say we miss 20% of the time. If our standard has to be we can’t miss ever, or we can’t miss on one player, man, that’s tough,” Ally says. “I don’t know anybody in sports who’s that good.

This statement is at the heart of what it means to be a psychometrician. We spend our lives trying to quantify uncertainty because all tests contain measurement error. This uncertainty can be related to the context or circumstances of the test-taker, who may be tired or distracted (as noted in the Journal piece about why Stroud’s S2 score may have been “invalid”). Uncertainty can be related to the measure itself since tests contain only a sample of the full range of questions or tasks that matter. And—brace yourself—some of the questions may be imperfect. Understanding measurement error helps users avoid overinterpreting relatively small differences in scores.

Misuse of Test Data in Education

Treating test scores as the “thing” we care about rather than as the proxies they are (reification), construct underrepresentation, and lack of appreciation of measurement error play out in many current educational measurement issues. The current debates about the use of test scores for college and elite high school admissions involve all three of these critical measurement concepts. Discussions of admissions testing rightfully address the degree to which these tests promote or hinder justice and diversity. These issues are important to the validation of tests for admissions purposes, but we are focusing here on these three fundamental measurement concepts.

If admissions officers at elite universities were to use specific scores on a test, especially relatively high scores, as the sole determinant of whether students are even eligible for consideration, we would question whether they are falling into the reification trap. Understanding measurement error would help admissions officers recognize that there is essentially no difference between scores within striking distance of the cutscore, no matter on which side the observed score may fall.

We believe, or at least hope, that when ACT or SAT scores are used in admissions decisions, they’re not used as a single indicator but rather as part of a body of evidence that can present a more complete picture and a better prediction of a student’s likelihood of success in college. Using multiple indicators, particularly ones designed to measure different aspects of postsecondary preparation can offset challenges associated with construct underrepresentation.

The Role of Time in Measurement

We started this post by describing the discrepancy between C.J. Stroud’s on-the-field performance and his S2 score. But we need to remember that as outstanding as Stroud’s performance was this year, it is still only one year. He could easily fall into a sophomore slump or become a one-year sensation. Bryce Young (the #1 draft pick) could become a better NFL quarterback.

Like player performance, student proficiency is not a static construct but changes over time and in different ways. This is all the more reason why test users must avoid these three fatal flaws of measurement and recognize the utility and fallibility of test scores.

Share: