How Do We Improve Interim Assessment?
Reflections From the 2019 Reidy Interactive Lecture Series (RILS) Conference
In the seacoast region of New Hampshire, we are enjoying the kind of crisp early autumn temps that might call for a light sweater, and the foliage reveals just a hint of the color that draws ‘leaf peepers’ to the region each year. But it wasn’t just the postcard-perfect scene that drew more than 80 education and assessment leaders from around the country to Portsmouth on September 26-27, 2019. The Center’s annual Reidy Interactive Lecture Series (RILS) offered an opportunity for those assembled to learn and contribute ideas around an important topic: how can we improve interim assessments? More specifically, what resources and strategies will help us select, use, and evaluate interim assessments more effectively?
Straightway, I want to credit my colleagues Juan D’Brot and Erika Landl, who did a masterful job organizing the conference. Not only did they coordinate a series of presentations and discussions, each led by superb experts and practitioners too numerous to mention individually (but, seriously, thank you to ALL our contributors!), but they also developed and distributed an initial version of an interim assessment toolkit at the conference. The toolkit is designed to help leaders identify high-priority needs and uses, then select and evaluate interim assessments with respect to these purposes. The toolkit will benefit from ongoing refinement and development, some of which were elicited at RILS. However, consistent with the Center’s open access ethic, a beta version is fully and freely available now on our website. Check it out, and we welcome your feedback.
But this post is about more than raving about the beautiful fall weather or even bragging about the good work of my colleagues. I’m also writing to share some reflections about what I learned from all the participants at RILS, some of which I shared in my closing remarks at the event. I’ve organized my takeaways as three “Dos” and three “Don’ts.”
To be clear, these are my reflections inspired by both the conversations at RILS and my experiences; these are NOT any ‘official records’ of outcomes from RILS. Any limitations or shortcomings in these ideas are mine alone.
Do prioritize practices over products.
When we think about the promise of effective interim assessment, we should think first about the practices we want to encourage. For example:
- Do we want to signal and stimulate deeper learning? Then let’s talk about interim assessments that are designed to assess higher-order thinking skills, such as by rich performance tasks.
- Do we want to pinpoint learning needs for students who are currently not meeting performance expectations? Then let’s talk about frequent, focused assessments representing a credible learning progression at a smaller grain size in order to provide useful instructional insights.
The point is, our conversations should start with a recognition of the types of purposes and practices we value – not what products we think we should adopt. I believe Erika and Juan’s interim assessment toolkit helps us do just that.
While we’re at it, let’s have less fighting over terms (e.g. formative, interim, benchmark, summative, diagnostic, screener, etc.) and more focus on describing assessments in terms of their uses. Assessment developers and experts love to create new terminology and police the usage of extant terminology to the point that practitioners can be paralyzed. Sure, clear terms are needed for efficient communication. But, instead of arguing over whether a test is, say, interim or summative, let’s talk about how the test is supposed to be used.
Do look for ways to make incremental progress.
It’s exciting to watch a baseball player hit a game-changing home run, but baseball teams that hit a lot of singles are also successful. Besides, home run hitters tend to strike out a lot.
Similarly, if we’re reluctant to pursue reforms or improvements in assessment because the conditions may be seen as too limiting to launch a game-changing reform, we risk continuing to support a sub-standard status-quo. There are state, district, and school leaders making positive changes now to improve interim assessment practices. That’s not nothing.
For example, we can worry that commercial interim assessment products are far from ideal and dream of a world where they look very different. Or, as some leading states are doing, we can develop criteria for high-quality assessments and evaluate existing products against these criteria. By so doing, we reward superior solutions and incent ongoing, incremental improvement.
As another example, we can lament that summative assessments are too limited to reflect the kind of learning that matters most. But some forward-thinking districts are making incremental progress with assessment initiatives, such as performance tasks or exhibitions. These initiatives seek to incentivize and measure skills that go beyond what is typically covered on state accountability tests. These initiatives may not be perfect, but they are incrementally improving practice.
Do focus on capacity building.
Improving interim assessment is not a problem that measurement wonks alone can solve. Moreover, while efforts to systematize the process through resources like the toolkit are helpful, they are almost certainly not singularly sufficient. Even the most accessible initiatives require a baseline of expertise to deliver on their promise.
We need sustained, scalable initiatives to build enough capacity to improve practices for understanding and using assessments well. Capacity-building likely looks different for various groups (e.g. policymakers, district leaders, classroom educators, etc.). There’s much more that can be written about this topic, but for this post, it is sufficient to say investments in capacity-building are critical to the success of any effort to improve the selection, use, and evaluation of assessments.
Some, myself included, have argued that the limitations of commercial assessment products are not the chief problem; rather, it’s the outsized claims that have been made about these products. Indeed, one could be forgiven for wondering if the marketing and psychometric teams within some companies regularly communicate. Even a quick scan of the websites or brochures produced in support of commercial assessment products often reveals a long list of supported uses and interpretations from a single test (e.g. provides diagnostic information, informs instruction, predicts summative test performance, measures academic growth, etc.). Unfortunately, if we ask a test to do everything well, it probably does nothing very well.
But in my experience, there are some impressive accomplishments and ongoing promising work being done by our friends in the commercial assessment industry. For example, commercial developers have pioneered very efficient adaptive models, have produced very user-friendly static and dynamic reports, and have created some innovative item types. Given the vision, talent, and resources of these companies, I think we can expect more innovation to come.
In short, I urge these companies to be very clear about what claims are and are not supported for their products, under what conditions the intended uses and interpretations are credible, and what evidence exists in support of these interpretations. I argue that overpromising is more than ‘puffery’ that can be shrugged away as contemporary marketing. It damages the credibility of the field and, most importantly, it inhibits good educational practice.
Don’t be constrained by summative assessment.
I’m skeptical that making interim assessments look like ‘mini-summative’ tests is a promising path forward.
The summative assessment is typically intended to be useful for generalizing about student achievement with respect to a broad range of standards. However, I see more promise in interim solutions that are less about generalizability and more about richer insights into student achievement within a specific context. For example, if one wants a greater understanding of how well a student can analyze text or make and support an argument, what’s more likely to be helpful to educators – a scale score or a sample of student work?
One might argue that we can do both, but there is a danger in ‘over-engineering’ interim assessments by introducing technical and administrative constraints that work against the instructional utility of the test. If we relax some of the psychometric and administrative requirements more appropriate for end-of-year accountability tests, we make available a larger ‘sandbox’ to innovate.
Don’t separate curriculum and instruction from assessment.
This point is not original or even particularly complex. But I worry it is often overlooked. I’ll leave it at that.
I’ll close with one additional ‘don’t.’ Don’t falter in the commitment to improved outcomes for students. I was so encouraged at RILS to witness the clear focus on student learning at the core of almost every conversation.
While participants shared a variety and ideas, insights, and initiatives, I noted an explicit grounding in a larger view (whether we call it a theory of action, a logic model or something else) for how we can and should leverage practice to improve student achievement. That focus gives me optimism about the path forward.