On the tedious and joyful process of writing 125 AI evals

written by Graham Knapp on 2025-10-30

I recently wrapped up a piece of work that involved writing 125 individual test cases, or 'evals', for an AI language model.

I can’t share the specific details unfortunately but the process itself was surprising and helped me understand LLMs a little better. It’s a strange task that’s one part coding, one part creative writing, and one part pure quality assurance.

Oh go on then I'll give you some details - these are evals for multimodal vision LLMs on computer use type tasks - doing a short sequence of related tasks using a single piece of software based on a short text prompt with a goal in mind. I wrote the evals in a custom-built environment allowing me to execute the task myself whilst writing the instructions alongside. I've heard multiple people recommend this approach, rather than trying to fit a specific task into an off-the shelf eval building solution.

After writing 125 of them, I’ve formed some strong opinions on the process, what works, and what keeps me sane...

It’s a job of contradictions

The first surprise is that writing evals is not a walk in the park.

One moment, you are buried in the most mind-numbing, repetitive data entry imaginable. The next, you’re trying to craft a perfectly balanced, tricky prompt, and it feels like a genuinely creative puzzle.

I can't just brute-force it. I found you have to get into a flow state, a bit like a good coding session. When you're "in the zone," you can design and write a whole cluster of related test cases, but when you're not, every single one feels like a slog. Sometimes, the only solution is to recognise the tedium has won and just take a break. Usually "the zone" sits just beyond that moment where you want to give up - you have to go through the wall.

The 'Goldilocks' level of complexity

One of the main challenges is getting the complexity right.

Too easy: A test case with just a single step or request is usually too simple. It doesn't represent a real-world, interesting test of the model's capabilities.
Too long: Writing these things takes time and the more steps there are the more opportunities there are to go wrong in the writing. And even I don't have the patience to write dozens of 20-step challenges.

The sweet spot for this specific set of tasks in late 2025 seems to be around 2-5 steps. This is complex enough to be a meaningful test of reasoning and instruction following, but not so complex that you're just setting the model up to fail.

How to stay motivated: humour and precision

Let's be honest: writing your 87th test case on a Tuesday afternoon can be a drag. I found two things help me stay engaged:

First, look for opportunities for fun. I started embedding little jokes, puns, and wordplay into the test cases. Playing with corporate double-speak, using absurd character names, or creating ridiculous scenarios. This kept me interested. It’s more fun to test a model’s ability to "write a professionally-worded email declining an invitation to a mandatory fun-day to synergise core competencies" than something generic.

I also found it helpful to draw on real-life examples with some emotional resonance, either positive or negative. That genuinely infuriating debugging session I had last year? The surprisingly wonderful feedback I got on a project five years ago? These events also make for great, realistic test scenarios because they have a human texture that's hard to invent from scratch, they add variety to the training data as well as keeping me interested.

Second, be ruthlessly precise. This is the QA side of the brain taking over.

On Input: Use copy and paste for any specific terms, names, or phrases. You have to be meticulous. A single typo or spelling variation invalidates the test, because you won't know if the model failed the task or just failed to understand your typo or missed a step which you forgot to write down.
On Output: This is the most critical part. You must be completely unambiguous about the expected output format. Don't just say "summarise this." Say "Summarise this text into a single paragraph of no more than 50 words." If you want JSON, define the exact schema. Any ambiguity in the "correct answer" makes the test case useless, or much harder to automate.

Writing evals for LLMs is a strange discipline. It takes the logic of a software developer, the precision of a technical writer, and the creative spark of someone who needs to invent 125 different poems on a deadline. It's tedious, it's fun, and it's a long way from being automated.