Measuring Durable Competencies: Right-Sizing Expectations

We risk asking more of measurement than current evidence supports

Pressure to measure durable or “non-academic” competencies is surging. States and districts are exploring new ways to document that graduates of K-12 schools are ready for college, careers, and civic life, in part by adopting “Portrait of a Graduate” or other methods of communicating the broad set of outcomes schools are expected to promote. Measuring competencies like collaboration, self-management, and perspective-taking can signal their importance and inform decisions about instruction, improvement, and accountability.

This renewed attention reflects a longstanding recognition that these competencies matter for students’ postsecondary success. But it also resurfaces a persistent challenge: measuring these competencies in ways that recognize the limitations of what those data actually tell us. As our interest in measuring these competencies grows, so does our risk of asking more of measurement than the current evidence can support.

For state and local education agency leaders, this moment raises a set of practical questions:

What interpretations can we actually derive from currently available measures?
How should we use scores on these measures to inform decisions, particularly when they appear precise but reflect a good deal of uncertainty?
How can we align our measurement approaches with the reasons we’re collecting the data in the first place?

Several implications follow from these questions. First, expectations for what these measures can do may need to be recalibrated. Second, interpretation and use are often highly context-dependent, complicating comparisons across schools or student groups. Third, measurement approaches need to be designed and validated for their purpose, whether that purpose is local improvement, research, or system-level monitoring. Finally, results are often best understood as signals that invite further inquiry rather than as definitive indicators.

In my closet is a T-shirt that sums up these implications. It says “Don’t Overinterpret the Data.” It was a gift from an official in a state where I served on a technical advisory committee. Although I don’t wear it frequently, its message offers a useful reminder: even well-designed measures require careful and context-sensitive interpretation.

A recent chapter I wrote with Jim Soland, published in the AEFP Live Handbook of Education Policy Research, reinforced these implications. We examined emerging approaches to measuring durable competencies and offered guidance for responsible use of these measures. The review highlighted both promising developments, including some enabled by advances in technology, and persistent constraints that shape how these measures can be used responsibly.

Limitations of Current Durable Competency Measures

One of the most consistent patterns we found across currently available measures is their reliance on self-reporting. Questionnaires that ask students to rate their own behaviors, attitudes, or mindsets are widely used because they are relatively quick and inexpensive to administer and score. They can be especially helpful for understanding students’ beliefs and perceptions about their own competencies.

But self-report measures introduce validity concerns. Students may interpret items differently, respond in socially desirable ways, or anchor their responses to local norms rather than an external standard. This last issue, often referred to as reference bias, reflects individuals’ tendency to compare themselves with others when evaluating their own skills. In one study, students reported lower levels of self-regulation when surrounded by higher-achieving peers, limiting the usefulness of the measure for predicting later outcomes. In these cases, differences in scores may reflect differences in interpretation as much as differences in the competencies themselves.

More broadly, regardless of assessment format, reliability and validity vary across contexts and purposes. The importance of connecting validity and reliability to purpose is well documented. A measure that works in one setting may behave differently in another, depending on student population, administration conditions, or how the instrument is introduced.

The connection between durable competencies and academic domains presents a particular challenge. Competencies like collaboration and self-management can look different across subject areas, and a student might demonstrate strong competencies in science class but not in English class. Durable competencies are tightly intertwined with academic content, which complicates efforts to generalize across content areas.

Related: See our toolkit, Assessing 21^st Century Skills, for a collection of blogs and literature reviews.

Differences in student development add another layer of complexity. As students progress through school, their cognitive capacities and social contexts expand. Measures, and the definitions that underlie them, need to reflect these changes. Learning progressions, which describe how competencies develop over time, can support the design of developmentally appropriate measures. However, relatively few empirically validated learning progressions currently exist for many durable competencies, although recent initiatives are beginning to address this gap.

Even when measures are designed with development in mind, evidence on sensitivity to change remains uneven. In principle, tracking changes in competencies over time could inform instruction and program design. In practice, changes in scores may reflect not only real development, but also shifts in how students interpret items or differences in the contexts in which they respond. Distinguishing among these possibilities is not always straightforward.

Finally, advances in technology introduce both new possibilities and new uncertainties. Researchers are increasingly using large language models and other AI-enabled tools to support assessment design, scoring, and reporting. These approaches have the potential to improve efficiency and expand the range of tasks that can be used to elicit evidence of durable competencies. At the same time, evidence of their validity, reliability, and fairness is still emerging, particularly across diverse social and cultural contexts.

These considerations help explain why experts continue to caution against using these measures for high-stakes decisions that directly affect students. Although many states and districts have adopted Portrait of a Graduate frameworks or similar models, large-scale measurement, especially when tied to consequences for students or educators, remains limited.

What Does Responsible Durable Competency Measurement Look Like?

For state and local education agency leaders, the question is not simply whether to measure these competencies, but how to align measurement with purpose.

One place to start is with clarity about purpose. Measures designed to support local improvement often prioritize interpretability and relevance to educators and students. In contrast, measures intended for research, evaluation, or accountability typically emphasize consistency and, in some cases, sensitivity to change. Attempting to meet all of these goals simultaneously can introduce tension, underscoring the importance of aligning measurement design with intended use.

Questions of comparison raise related challenges. Given the potential for reference bias and contextual variation, comparisons across schools or student groups are not always as informative as they appear. In some cases, patterns within a school over time, or variation across items within a single administration, may provide more useful signals. Even then, these uses depend on evidence of validity, reliability, and fairness.

Interpretation presents another layer of complexity. Results from durable competency measures are often reported as numeric scores or scale points that convey a sense of precision. That precision can be misleading if it obscures underlying uncertainty. Treating results as indicators that prompt further inquiry, rather than definitive judgments, may better reflect what the data can support.

It is also useful to consider how these measures fit within a broader body of evidence. Like most assessments, durable competency measures are typically more informative when used alongside other sources of information, such as classroom observations, student work, or school climate data, rather than as stand-alone indicators.

In many settings, the most productive use of these data is to surface questions rather than provide answers. Results can prompt conversations about instruction, student experience, family engagement, and system design. Used in this way, measurement becomes a starting point for inquiry rather than an endpoint for decision-making. Instead of focusing on high-stakes decisions about outcomes, these measures can be used to identify needed supports.

Reframing the Role of Measurement

As interest in durable competencies continues to grow, it is easy to ask more of measurement than the current evidence can support. A more productive stance may be to right-size our expectations.

This is not an argument against measurement, or against the importance of these competencies. It is an argument for alignment between what we want to know, how we propose to measure it, and how we intend to use the results. Achieving this alignment often involves navigating tradeoffs: between comparability and actionability, precision and interpretability, and broad system use and local relevance.

Approached with clarity about these tradeoffs, durable competency measures can play a constructive role by informing conversations, supporting local improvement efforts, and contributing to a more comprehensive understanding of student learning.

Photo by Allison Shelley/The Verbatim Agency for EDUimages