LLMs That Don’t Suck at Technical Writing: My Test Results
As a senior technical writer working a defense gig, I’ve tested nearly every major large language model (LLM) under pressure. I’ve thrown them into the deep end: user manuals, API docs, SOPs, knowledge bases, and compliance-heavy content that would make a junior engineer cry.
Most of these models suck at technical writing. They hallucinate, over-explain, forget the audience, and write like they’re auditioning for a TED Talk instead of a software deployment guide.
But some? Some are actually decent.
Here are my test results - real-world, no-fluff, Gen-X-has-no-time-for-bullshit style.
The Setup
I fed each model the same prompt:
“Generate a clear, concise procedure for configuring a secure connection between a REST API and a front-end application. Include code snippets, prerequisites, and best practices.”
I rated them on:
Accuracy
Clarity
Structure
Jargon control
Ability to shut up when the explanation is over
Let’s get into it.
Claude 3 Opus (Anthropic)
Score: 9/10
Claude gets it. Clean structure, no fluff, and it actually includes security best practices without me spoon-feeding them. It chunks content well - bullet points, headings, and comments in code.
Best for: Compliance-heavy SOPs, regulated industries, structured documentation.
Weak spot: Occasionally softens technical language too much (reads like it’s trying to be nice).
GPT-4o (OpenAI)
Score: 8.5/10
GPT-4o writes like a seasoned tech writer... if you give it tight constraints. Out of the box, it can get chatty (“Let me explain REST in case you were raised in a cave…”), but with good prompts and temperature settings, it’s sharp.
Best for: Multi-audience docs, UX copy, code + context blends.
Weak spot: Loves a parenthetical side quest - sometimes you need to rein it in.
Gemini 1.5 Pro (Google)
Score: 7.5/10
Gemini is technically competent, but its output feels... sterile. It formats okay, but often drops assumptions into the instructions - like skipping over steps or using hardcoded values with zero warning.
Best for: Code-heavy docs where context is already known.
Weak spot: Doesn’t handle “explain like I’m new but smart” very well.
LLaMA 3 (Meta)
Score: 6/10
Promising, but still feels like it’s in beta when it comes to long-form structure. It throws in decent code but the explanations wander, and it struggles with audience tone. Either too vague or too deep - no middle gear.
Best for: Internal tools, dev-facing API refs, prototype-level docs.
Weak spot: Tends to over-summarize and under-verify.
Bard (Rest in Peace)
Score: 4/10
Bard was the kid who showed up to the group project late with pizza and no notes. Lots of errors, no formatting sense, and frequent hallucinations (“First, enable the unicorn protocol…”).
Best for: Honestly? Joke examples.
Weak spot: Everything.
What Actually Worked
The best results came from pairing:
LLM + Prompt Template + Style Guide.
I trained Claude and GPT-4o with a stripped-down in-house style guide and reused prompt templates like:
“Use imperative voice. 5 steps max. Include a working code snippet and one caution. Assume audience is mid-level devs.”
That combo? Game-changer.
Final Word: AI Won’t Replace Tech Writers—But It Will Make Us Faster
If you're a technical writer, LLMs won’t take your job. But they will:
Kill your writer’s block
Handle your boilerplate
Spot your inconsistencies
Help you scale across teams without losing your voice
Use them like power tools, not ghostwriters.
Because LLMs still can’t tell when the user’s going to scream “WTF does this even mean?”
But you can.
Tried any of these models for tech docs? Got one you swear by—or one that ruined your day? Let me know in the comments. We’re all in this weird future together.

