Evaluating Generative AI Feedback in Classroom Assessment: A Meta-Synthesis

Does GenAI provide students with effective feedback?

This is the first in a series of posts by our 2025 summer interns, based on the projects they designed with their Center mentors. Elie ChingYen Yu, a doctoral student at the University at Albany-State University of New York, worked with Associate Director Carla Evans.

Generative Artificial Intelligence (GenAI) tools are increasingly used to provide feedback in classrooms, yet their adoption has outpaced scrutiny of the feedback they produce. As part of my internship, I designed a study that synthesizes 33 research reviews so we could examine how GenAI feedback is characterized and whether it aligns with research-based principles of effective feedback.

Based on my study, Dr. Evans and I created a set of 10 criteria that can serve as a practical tool to help educators evaluate GenAI tools for the quality of their feedback and user interface before using them.

According to UNESCO, GenAI is “an artificial intelligence (AI) technology that automatically generates content in response to prompts written in natural-language conversational interfaces … GenAI actually produces new content.”

Since the release of ChatGPT in late 2022, interest in GenAI in education has surged. Numerous research reviews highlight its potential to deliver timely, personalized feedback to students about the quality of their thinking and work products. However, few researchers examine the nature or quality of this “new content” when it is intended as feedback to students. The study I designed addresses that gap by analyzing how GenAI feedback is characterized in recent reviews and evaluating its alignment with principles of effective feedback.

Feedback Is a Powerful Tool for Moving Students’ Learning Forward

Decades of research have shown that high-quality feedback tells students where they are in relation to a task goal. It is most effective when delivered under the right conditions: with the right level of detail (task-oriented specificity), at the right time (during learning), to the right person (students receptive to feedback), and with the right intention (supportive and respectful). Yet it remains unclear if GenAI feedback aligns with those principles. The goal of my internship this summer was to synthesize the reviews published between 2022 and 2025 to address the following research questions:

How is GenAI feedback defined and characterized in review studies on classroom assessment? To what extent does it reflect principles of effective feedback?
What themes emerged in the review related to GenAI feedback?

How We Conducted Our Study

Working with Senior Associate Carla Evans, I followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, applying GenAI and classroom assessment search terms across three academic databases and Google Scholar. Research did not have to be peer-reviewed, reflecting the rapidly evolving nature of GenAI research and reporting in education.

We focused on synthesizing review articles. We saw a surge of reviews published within two years of ChatGPT’s release, some cited over 3,000 times (as of July 18, 2025), suggesting that these studies shape early understanding and adoption of GenAI in education. These reviews reveal not only the scope of evidence but also how GenAI is being framed and communicated in the research community.

I coded eligible studies were using Ruiz-Primo and Brookhart’s (2018) framework, focusing on three feedback dimensions: context, content, and attributes.

What We Found: Many Problems With Feedback to Students

I analyzed 33 research reviews—mostly containing studies from higher education. ChatGPT was the most frequently referenced GenAI tool, though some reviews discussed BingChat and Claude.

Research Question 1

The research syntheses I reviewed described GenAI feedback with various buzzwords, including “immediate,” “individualized,” “personalized/adaptive,” “comprehensive/detailed,” “objective,” and “high-quality.” Yet these terms were rarely defined or illustrated. Most characterizations emphasized context (e.g., immediacy, accessibility) and attributes (e.g., personalization, objectivity), with little discussion of instructional appropriateness.

For instance, reviews highlighted immediate feedback but did not consider whether such timing suited the learner or task complexity. Descriptors like personalized were invoked without clarifying what was tailored to the individual or how. Often, “personalized” simply refers to GenAI’s real-time responses, not adaptation to prior knowledge, goals, or context. The content of the feedback, such as its focus and reference, was largely unaddressed.

Research Question 2

To move beyond surface-level descriptors, I conducted a thematic synthesis of recurring patterns across reviews, supplemented with insights from selected primary studies cited within those reviews. I highlight a few themes below.

Hallucination and Misinterpretation

Accuracy and reliability—often assumed in feedback literature—cannot be presumed with GenAI, whose outputs are based on probabilistic prediction. Central to these concerns is the phenomenon of hallucination, where, in the context of feedback, GenAI misinterprets student responses, applies nonexistent evaluation criteria, or invents structures not present in the student’s work.

The trustworthiness of GenAI feedback is a significant concern, especially for students who may not be able to critically evaluate the accuracy of the feedback, and for teachers who lack the time to review the feedback provided to all students in their classes and, therefore, may be blindly trusting the quality of the feedback.

Comparison With Human and Rule-Based Feedback

Researchers have observed significant differences in feedback quality when comparing GenAI feedback to feedback from human raters or rule-based systems. Compared to trained raters or rule-based tools like Grammarly, ChatGPT’s feedback often lacks prioritization (what is the most important), clarity of directions for improvement, structure, and a supportive tone, frequently overwhelming students with excessive or vague comments.

Equity Concerns for Vulnerable Student Populations

Several studies raised concerns about using GenAI tools for feedback with vulnerable student populations, particularly those with lower language proficiency, less prior knowledge, weaker metacognitive skills, or limited skills in regulating their learning.

Studies showed that GenAI feedback is often complex or too generic for lower-intermediate language learners to understand and apply effectively, especially when it lacks cultural or linguistic adaptation. Students with lower metacognitive awareness are more likely to adopt GenAI feedback passively, rather than learning from the feedback provided.

Cautionary Notes and Remaining Questions

Empty Claims Undermine Feedback Principles

Our review revealed that while the context and attributes of GenAI feedback are frequently emphasized, they are rarely defined or justified. The content of feedback is often neglected for vague descriptors such as personalized and adaptive.

Effective feedback prioritizes critical areas for students to improve, whereas GenAI feedback tends to comment on everything, which can overwhelm learners, especially those with limited prior knowledge or self-regulatory skills. Yet most reviews overlooked an appropriate balance of amount, focus, and learner readiness, obscuring the nuanced interplay of the content, context, and attributes that define effective feedback.

Accuracy Is Foundational, but Shouldn’t Be Assumed

This review reveals that a key assumption in the feedback literature—accuracy—is taken for granted when applied to GenAI feedback. Frameworks like Ruiz-Primo and Brookhart’s discuss qualities of effective feedback but do not explicitly treat accuracy as a separate component, given that the source of feedback is often the teacher, whose accuracy is typically assumed. However, with GenAI, that assumption no longer holds, since it is built upon prediction, not understanding. Therefore, when using GenAI to provide feedback, accuracy must be foregrounded; otherwise, evaluating effectiveness is premature.

No Prompt Can Make GenAI Understand

Some argue that GenAI feedback flaws can be corrected through better prompting; however, recent evidence shows that GenAI often prioritizes user satisfaction over truth, even after model fine-tuning or advanced prompting strategies. The user feels satisfied with the answer because it sounds good, but the feedback may be misaligned or incorrect.

Feedback Is Consequential

Feedback is consequential, especially when used for instructional decisions. For example, students may receive incorrect feedback or not receive feedback when they should. To reduce harm, we recommend keeping humans in the loop—not as an abstract principle, but through concrete tools for evaluating the quality of GenAI feedback before purchasing or using a GenAI tool.

Click here to use the practical evaluation tool we created to support educators in evaluating any GenAI tool for the quality of its feedback and user interface before purchasing or using it in their classrooms, schools, or districts.

Photo by Allison Shelley for EDUimages