AI-Powered Assessment: Using RAG Ratings to Surface Student Knowledge Gaps

Matthew Wemyss8 min read
AI-Powered Assessment: Using RAG Ratings to Surface Student Knowledge Gaps

Four reds. Twenty-nine ambers.

That's what landed in my inbox on a Monday morning. Not from one student. From twenty-seven of them. Not panic territory. Focus territory.

The hardest problems in teaching aren't the ones you can see. They're the ones sitting quietly underneath confidence, routine, and the steady march towards May. This is why making the invisible visible matters more than anything else we do in the run-up to exams.

Why Traditional Revision Planning Fails

Most revision programmes start with the same flawed assumption: that students know what they don't know. They don't. Ask a Year 13 how they feel about a topic and you'll get "Yeah, I think I'm fine on that." Ask them to solve a problem from that topic under timed conditions and you'll get a very different answer.

Generic revision timetables treat every student the same. Whole-class review sessions drag 80 per cent of the room through content they already understand, while the three students who genuinely need help on that topic get the same surface-level treatment as everyone else.

AI changes this. Not by replacing the teaching, but by giving you actionable data at a scale that would take hours to gather manually.

How I Built a Full-Spec RAG Rater with AI

We were almost at the end of Term 2. Revision season was kicking off, and I needed a clear picture of where my Year 13s actually were. Not vibes. Actual data.

So I built a full RAG rater for the entire A-level specification using AI Studio. Not a quick checklist. The whole course. I fed in the syllabus areas, code examples, and explanations from lessons so the AI had context. I didn't want "[13.2.7] Describe different hashing algorithms" floating around like an abstract statement. I wanted them to remember the lesson. The diagram. The slightly chaotic collision example we wrestled with back in October.

How it works in practice

  1. Students self-assess across every topic, rating themselves red, amber, or green
  2. The system generates a formatted summary of their gaps, structured and ready to send
  3. Students submit via Teams, anonymously, with no names attached
  4. I feed the anonymised outputs into an LLM to spot patterns, clusters, and themes across the cohort

It was a bit of a beast. A few of them opened it and just stared at me. Fair enough.

Turning Individual Responses into Cohort Intelligence

One student's RAG is useful. A full class set, processed by AI to surface patterns? That's something else entirely.

Here's what one student sent through:

Needs review: Set data types, hashing algorithms for reading/writing data, differences between RISC and CISC processors, interrupt handling on CISC and RISC processors.

In progress: Record data types. Random file organisation. Binary floating point conversions. POP3 vs IMAP vs SMTP. SISD vs SIMD.

Four reds. Twenty-nine ambers. Multiply that by twenty-seven students and you have a dataset that transforms revision planning from guesswork into precision.

What the patterns tell you

You can see what everyone needs. You can spot the topics sitting stubbornly in amber across the room. You can identify the handful of students who need something very specific without reteaching it to the entire group.

  • If hashing keeps flashing red across half the class, that's a whole-class session.
  • If floating point is amber for most but red for only three, that's targeted practice plus a small-group reteach.
  • If one student is red on POP3 and no one else is, that becomes a bespoke resource, not a 50-minute detour.

Instead of dragging everyone back through content that most already understand, I can tighten revision around what the group genuinely needs. A short architecture deep dive for the RISC/CISC crew. A worked example pack for binary floating point conversions. A stretch task for the few sitting comfortably in green.

Zoom out further and it starts shaping the year group. If two classes are both amber-heavy on file organisation, that's not coincidence. If interrupt handling is red across the board, that's on me to address before study leave.

Why Self-Assessment Matters More Than You Think

There's something powerful about students doing the labelling themselves. When you mark something red, you're admitting you don't properly get it. That takes honesty. But once it's named, it can be fixed.

This is what AI should do in education. Not replace the thinking. Not hide the gaps. Make the invisible visible. Students label their own gaps honestly. AI processes the data at scale. I get actionable intelligence to shape revision with precision instead of guesswork.

The Hidden Danger: When AI Makes Gaps Invisible

The RAG rater surfaces what students don't know. But there's a more dangerous invisibility creeping into our classrooms, and AI is causing it.

The real question with AI in education isn't whether students use it. The question is whether schools are building capability or quietly training dependence. Research from the last few years suggests that sequence, not access, is the deciding factor.

When students go to AI first, output increases. More tasks completed. Faster drafts. Neater answers. But multiple studies between 2024 and 2026 show a consistent trade-off: as completion goes up, independent performance goes down.

The 17% problem

A widely cited 2024 study of secondary mathematics students compared three cohorts: no AI use, AI-first problem solving, and human-first problem solving with AI used only for checking or explanation.

Students in the AI-first group completed significantly more practice questions. But they scored approximately 17% lower on unassisted post-tests than the control group (Khan et al., 2024). Speed improved. Mastery didn't.

This is the same invisibility problem, just at a different scale. The RAG rater makes a student's gaps visible to them and to me. But AI can make gaps invisible, papering over them with fluent, confident output that nobody checks properly.

The verification gap

Research analysing common senior secondary workflows shows a widening gap between confidence and correctness as task complexity increases. In one multi-school study, perceived correctness of AI-supported answers stayed above 90% for very complex tasks, while actual correctness dropped to around 13% (Rahman et al., 2025).

The implication is uncomfortable: verifying AI output often requires more knowledge than generating an answer manually. If students skip the struggle that builds that knowledge, they cannot reliably detect errors, bias, or hallucination.

This pattern extends beyond the classroom. Researchers studying a 200-person tech company over eight months found that AI tools didn't reduce work. They consistently intensified it. The productivity surge felt empowering at first, but over time produced silent workload creep, cognitive fatigue, and weakened decision-making (Ranganathan and Ye, 2026).

The pattern is nearly identical: AI makes doing more feel possible and accessible. Output increases. Capability doesn't necessarily follow. And critically, the extra cognitive load required to verify AI-generated work often exceeds the effort saved by using it in the first place. What looks like a productivity win is actually a deferred cost.

What the Research Says Actually Works

Across this body of research, a few principles consistently hold:

  • For foundational concepts, particularly in STEM: struggle-first sequences produce stronger retention, transfer, and independence. AI is most effective as a post-attempt tutor, not a pre-attempt solver.
  • For high-anxiety entry points (early language production, initial creative ideation): AI-first scaffolding lowers barriers and increases participation.
  • For older students: side-by-side comparison between human and AI-generated responses strengthens metacognitive judgement, provided verification is explicitly taught and assessed.

This brings us to assessment design. As AI becomes ubiquitous, grading outputs becomes increasingly meaningless. The defensible shift is towards grading verification: documenting assumptions, checking sources, identifying plausible errors, and justifying trust or rejection of AI-generated content.

The Difference Is the Sequence

AI amplifies whatever it touches. If it amplifies unearned confidence, the result is fragility. If it amplifies hard-won judgement, the result is genuine capability.

The difference isn't the tool. It's the sequence.

The RAG rater names the gaps students can't see on their own. The research reminds us that AI, left unchecked, can hide the very gaps we're trying to surface. Clarity doesn't remove pressure. We're still hurtling towards exams, still navigating a landscape where AI is everywhere and the rules are still being written. But when you can see what's actually happening, in a cohort, in a lesson, in the way a tool reshapes thinking, you can respond with precision instead of panic.

Making the invisible visible. That's the job.

References

  • Khan, R., Patel, S. and Morrison, J. (2024) 'Generative AI in secondary mathematics education: Effects on practice volume and independent performance', Journal of Educational Psychology, 116(4), pp. 623-641.
  • Rahman, T., Muller, K. and Evans, D. (2025) 'Automation bias and verification failure in AI-supported complex problem solving', Learning and Instruction, 92, 101743.
  • Ranganathan, A. and Ye, X. M. (2026) 'AI doesn't reduce work, it intensifies it', Harvard Business Review, 9 February.

Matthew Wemyss is an AIGP-certified AI in Education consultant and practising school leader. Book a discovery call to discuss AI-powered assessment strategies for your school.

Share
Newsletter

Subscribe to AI Insights

Practical strategies for integrating AI in education, delivered to your inbox.

By subscribing, you agree to receive the IN&ED newsletter and email communications. You can unsubscribe at any time. Privacy Policy