Blog · 2026-05-24 · 8 min read

How I Made My LLM Give Honest 4/10 Scores Instead of Always 8/10

LLM-as-judge fails by default — every score lands at 8 or 9. Here's the calibration rubric, anchored examples, and JSON schema pattern that fixed it. Used in production at ScriptHook.

Built ScriptHook to generate 5 UGC ad scripts per request. Each script comes back with a hook (first 1.5 seconds), value beats, CTA, B-roll, on-screen text, caption, hashtags.

Buyer needs to pick which hook to film first. So I asked the model to self-score each hook 1 to 10.

First version, every hook came back at 8 or 9. Sometimes a 7 if I begged. Never below.

That is not a score, that is a participation trophy.

Why Naive LLM-as-Judge Fails

Two reasons stack:

1. Models are trained to be helpful. Helpful overlaps a lot with encouraging. Telling a user their hook is a 4 out of 10 feels mean, so the model rounds up.

2. No anchor for what 4 means. If the prompt says score 1 to 10, the model has no shared scale with you. It defaults to the same polite middle distribution you see in Yelp reviews and Uber driver ratings.

The fix is not a smarter model. It is giving the model a calibrated rubric with examples so 4 out of 10 has a concrete meaning the model can match against.

The Fix in Three Layers

Layer 1: Anchored rubric in the system prompt.

Replace "score 1-10" with a table of what each score band actually means. For ScriptHook hooks the rubric reads:

9-10: Stops scroll in first 0.5s. Specific number or shocking claim. Cannot be guessed from thumbnail. 7-8: Strong opener, clear curiosity gap, but recognizable pattern. 5-6: Functional. Tells viewer what video is about. No urgency. 3-4: Generic intro. Could be the start of any video in the niche. 1-2: Apologetic, slow, or asks a yes/no question viewers can answer no to.

The model now has concrete language to anchor against. A 4 out of 10 has a definition. The polite middle bias still exists, but it has a floor and a ceiling.

Layer 2: Three worked examples in the prompt.

Show the model what a 9, a 6, and a 3 actually look like, with one-line rationale for each.

EXAMPLE 9: "Your skincare routine is making your acne worse — here is the one product to remove tonight." (Specific claim, urgency, names a category.)

EXAMPLE 6: "Three skincare tips for sensitive skin." (Functional but generic list format.)

EXAMPLE 3: "Hi guys, today I want to talk about skincare." (Apologetic opener, no hook.)

Three examples covering the spread is enough. More than five wastes tokens and the model starts averaging them.

Layer 3: Structured output with strict JSON schema.

Force the model to output JSON with a hook_score field typed as integer between 1 and 10. The schema validation rejects "8.5" or "around 7", which means the model commits to a discrete number instead of hedging.

With OpenAI structured output or Anthropic tool use, the schema is enforced at the API layer. The model literally cannot return invalid JSON.

What the Distribution Looks Like After

Before calibration: 100 hook scores, mean 8.2, std dev 0.6. Everything bunched at 7-9.

After calibration: 100 hook scores, mean 6.4, std dev 1.8. Real spread from 3 to 10. The 9s actually correlate with which hook I would film first.

Why This Matters for Anyone Shipping AI Tools

If your product uses LLM-as-judge for any user-facing score (lead quality, ad copy strength, code quality, resume rating), the default behavior is the same useless 8/10 every time. Users do not trust scores that never go below 7. They stop reading the score column.

Three layers above fix it without RLHF, fine-tuning, or a second model. The total prompt overhead is about 250 tokens.

Try Calibrated Scoring in Action

The ScriptHook homepage demo shows calibrated hook scoring live. Each generated script has a hook_score (1-10) and a hook_reason. Scores spread, not bunched. The 4s are real 4s. The 9s are rare and earned.

Free demo, no signup. $19 lifetime for the full Pro tier.

Try ScriptHook free

Generate 5 UGC scripts (hook, value, CTA, captions) for your product in 30 seconds. No signup.

Try the live demo