- TimelyGrader Newsletter
- Posts
- AI Assisted Grading for STEM?
AI Assisted Grading for STEM?
Some practical tips on what, how, and things to look out for.
Physics and chemistry are not the same as grading a short reflection or discussion post. Students show work in different ways. They make small math mistakes that carry through the rest of the problem. They use equivalent formulas. They draw diagrams. They submit handwritten work. All of this means AI can be useful, but it needs the right structure around it.
1. Build rubrics AI can actually use
Vague criteria descriptors are the #1 reason why AI grading can be inconsistent. "Shows conceptual understanding" means different things to different graders, including the same model across different runs. Using atomic, point-bearing criteria and description are potential fixes. For example:
Compare:
Vague: "Demonstrates correct application of Newton's laws"
Atomic: "Identifies all forces acting on the object" / "Draws forces from correct points of application" / "Applies F=ma with consistent sign convention"
A few practices that pay off:
List acceptable equivalent forms explicitly
Decide upfront what's point-bearing (sig figs, units, notation) and break those into separate criteria so they don't contaminate other criteria
Build a small library of the 5-10 most common student errors per problem type. Feeding these to the model as context dramatically improves consistency, because it recognizes patterns rather than reasoning from scratch each time.
2. Think multi-dimension: one for students, another for the AI
Students benefit from rubrics they can read and use to understand expectations. But AI benefit from much more granular detail. If you’ve spoken with the team here, you will often hear us say, ‘don’t make the rubric so detailed because it becomes paint-by-numbers’. That’s very true in this case.
You can think about this as a rubric that has multiple faces.
The student-facing rubric stays clear, pedagogical, and the goal is removing ambiguity. For example:
5 points:
Identifies all forces acting on the object
Draws forces from correct points of application
Applies F=ma with consistent sign convention
The AI/instructor facing grading rubric gets specific:
For "identifies all forces," that might mean: the specific forces expected for this problem (weight, normal, friction, applied tension), how to score when one is missing but downstream work is correct (ECF rules), and common omissions to flag (e.g., the normal force on inclines).
For "consistent sign convention," it specifies that any choice of positive direction is acceptable as long as the student stays consistent—and exactly what to do if they don't.
Analogies? Think of a rubric’s(k’s) cube. Whichever way you look at it, there’s always a non-visible side. That’s a multi-dimensional rubric. Some are visible to students but some are not.
3. Handwriting recognition isn't the differentiator it used to be
When working with STEM instructors, we often hear the question, ‘can it detect hand-writing and formulae’?
Optical character recognition (OCR) used to be a novel and meaningful feature in AI grading tools such as Gradescope. It isn't anymore: any frontier multimodal model handles handwritten student work reasonably well out of the box.
That said, it's still prone to mistakes. A "2" that looks like a "z," a "5" that reads as an "S," a hastily written "θ" that gets mistaken for a "0" - these are genuinely ambiguous, even for a human staring at the same page. The real question isn't whether your tool can do OCR; it's how it handles the cases where the input itself is unclear.
A few things help:
Set formatting expectations with students: box final answers, write units explicitly, prefer stacked fractions over slashed ones.
Pre-process for legibility, good contrast, straight orientation, one problem per page where possible.
Always maintain human-in-the-loop: always double check each student submission for OCR errors
For chemistry mechanisms, Lewis structures, and physics free-body diagrams, current multimodal models give a useful first pass but aren't yet reliable enough to auto-grade which is why we do not advocate for auto grading tools.
4. Make Error Carry Forward (ECF) explicit
ECF is the principle that students shouldn't be penalized twice for one mistake. If a student calculates the wrong acceleration but then correctly uses that value to find the final velocity, they should earn the points for the velocity calculation.
To get AI to do this reliably, instruct it to:
Identify the first error and its location in the work.
Re-analyze what each subsequent step should equal given that wrong intermediate value.
Score each step against the ECF-adjusted expectation, not the true answer.
Use symbolic computation (SymPy, Wolfram, a code-execution step) for the actual math. LLM arithmetic still slips on long calculations, and that's where ECF judgments most often go wrong.
ECF matters most in multi-step problems where one number flows through several steps (kinematics chains, force problems with sign errors, energy conservation, momentum problems). Specify in your AI/instructor grading rubric which criteria support ECF and which are absolute (a free-body diagram missing a force is wrong regardless of what's done with the equations afterward).
5. Keep humans in the loop always
AI grading will get things wrong. It doesn’t matter how accurate the AI is. The most important question should be: what happens when it gets it wrong? The key is that you need to plan for different types of errors:
Hallucinated errors: claiming a student made a mistake they didn't
Missed equivalents: marking a correctly-rearranged expression incorrect
Wrong attribution: calling a sign error a conceptual error
Diagram and mechanism misreads
Edge cases the rubric didn't anticipate
This sounds like it will take a lot of time. Let’s be real - it will. Using AI to grade isn’t as simple as other vendors may claim. Calibrating the AI is time consuming, especially in the beginning but it is worth it.
Before deploying on real student work, hand-grade 20-30 papers yourself, run the AI on the same set, and look at the delta. You'll surface systematic biases -harsh on certain problem types, lenient on others. Remember the multi-dimensional rubric? Now it’s time to use it.
Interested in using an AI-assisted grading tool that supports all of the above?
You know where to find us!
