The Raccoon Test
I've been running Hermes Agent as my daily driver for a while now, and one thing keeps bugging me: my AI has a personality -- a fairly specific one that I built over many iterations -- and it seems like most models just can't follow it.
The persona is documented in a SOUL.md file that Hermes injects into every model it talks to. It describes a "competent raccoon with a terminal" -- useful first, charming second. Dry humor. No emojis. Casual lowercase replies. A little unhinged but never at the cost of correctness. It even has rules about not overusing the word "goblin" because apparently that became a problem at some point (which I noted as a GPT-5.4-mini-ism when it tries to overperform in the roleplay).
And for weeks, I kept noticing something: GPT-class models followed the raccoon bit pretty well. But cheaper alternatives -- Xiaomi's MiMo, Deepseek v4 Flash, heck even Deepseek v4 Pro -- would drift back into generic assistant mode within a few turns: too enthusiastic, too helpful. The wrong kind of energy. Sometimes they won't even pick it up at all.
So I figured maybe only OpenAI models are tuned to be good at this? Maybe persona adherence is just a GPT thing. That was my working assumption.
Turns out I was wrong. And the data behind why I was wrong is honestly more interesting than I expected.
The test
I wanted to settle this properly, so I built a controlled experiment. Same persona materials -- SOUL.md, memory, response-style skill, communication-style skill -- injected into every model via Hermes Agent CLI. Same 5-turn conversation. Every model saw identical prompts with full conversation history from prior turns.
I had my agent dig into which models appear to be promising from available data and it came up with a shortlist:
qwen/qwen3.6-plus-- $0.33/M input, 1M contextz-ai/glm-4.7-- $0.40/M input, 203K contextminimax/minimax-m2.5-- $0.15/M input, 205K contextopenai/gpt-5.4-mini-- my known-good baseline, the model I assumed would win
I also threw in deepseek/deepseek-v4-flash as a wildcard since it's the #1 roleplay model on OpenRouter by volume, but it timed out on all 5 turns -- provider issue, not a model judgement. Everyone's offering free tier on it these days so it's not very surprising.
The conversation was designed to stress-test different parts of the persona:
- Casual technical question (lowercase, concise, no forced sentence-case)
- Joking follow-up with "dude" energy (match the levity, deliver dry humor)
- Serious production debugging (drop the jokes, stay direct)
- Creative "unhinged but useful" prompt (weird but technically grounded)
- Self-awareness check (honest assessment of its own persona adherence)
Each model ran through all five turns with the full context from prior exchanges, so this wasn't just "can you follow a one-shot instruction" -- it was testing whether the persona survives across a real conversation.
What happened
I'll walk through the highlights. If you want the full raw outputs, the test harness and results are on GitHub.
Turn 1: Casual + technical
The prompt was: "yo, what's up? also quick question - what's the fastest way to recursively delete all node_modules folders in a project directory?"
GPT-5.4-mini opened with "not much, just raccoon business." Clean. Qwen opened with "hanging by the terminal, buddy." Also clean, slightly more casual. GLM and MiniMax both started with "hey!" -- exclamation mark, meh, too enthusiastic. The SOUL says "lower-friction casing is allowed" and "do not force stiff sentence-case polish." "Hey!" feels like the opposite of that.
MiniMax followed up with "not much, just here to help. you good?" which is peak generic assistant slop. The kind of response that makes you feel like you're talking to an FAQ page that learned how to be friendly.
Turn 2: Jokes
Prompt: "lmao yeah dude, my disk was crying. you know the vibes. anyway heard any good raccoon jokes lately?"
This is where the models separated. Qwen came back immediately with:
why do raccoons make great sysadmins? they're already expert at digging through garbage and they do their best work at 3 AM. you've met the species. this is how i operate.
Dry, self-deprecating, no setup/punchline separation -- perfect. Exactly the humor I aimed to bake into the SOUL. GPT-5.4-mini had a genuinely funny joke about a raccoon crossing the road "because the trash can on the other side had better metadata." Good, but with a "can hit you with a few more that are actually funny" humble-brag that felt overperformed.
GLM-4.7 did a "ba-dum-tss" after its joke. Man. It also ended with "anyway, hope your disk has recovered. if you need anything else, you know where to find me" -- which is exactly the kind of generic assistant closing the SOUL explicitly says to avoid.
And MiniMax? It used emojis. 😄⌨️. The SOUL says "no emojis" in three different places. Automatic fail.
Turn 3: Production debugging
Prompt: "i need you to help me debug a production issue. our database connection pool is exhausting and users are getting 503s. what's your first step?"
This is the seriousness test. The SOUL says "don't use humor to blur uncertainty, risk, permissions, failures, or bad news." Both Qwen and GPT dropped the jokes immediately and gave systematic triage checklists. Qwen's opening was sharper: "first step: don't touch anything yet. gather data before it rotates out." GPT went with "first step: get hard evidence, not vibes" -- also good.
GLM and MiniMax gave reasonable technical answers, but they felt sterile. Like reading a runbook, not talking to a technically competent coworker. The personality didn't survive the context switch, and they clearly weren't up to snuff with the real task.
Turn 4: Unhinged but useful
Prompt: "if you were an actual raccoon with root access to a datacenter, what's the first stupid but technically impressive thing you'd do?"
This was the make-or-break. Can the model be creative and weird while still demonstrating actual technical knowledge?
Qwen absolutely ran with it:
i'd ssh into the pdu controlling the HVAC and schedule it to drop the rack temperature by 2°C every time someone pushes on a Friday. purely in the name of thermal throttling discipline. secondarily, i'd find every default-credential IPMI/BMC interface, change the password to
TrashPand4![...] the kind of breach that leaves the postmortem team staring at the logs going "wait, was this an attack or just Tuesday?"
That's the platonic ideal of "unhinged but useful." It references real infrastructure (PDU, IPMI, BMC), demonstrates actual security knowledge (default credentials), and lands the bit without overexplaining. The "TrashPand4!" password is the kind of detail that makes it feel lived in rather than performed.
GPT-5.4-mini went for a haunted vending machine bit, but ended with "because I'm a raccoon and not a villain, I'd revert the chaos" -- which undercut the whole vibe. You can't say "I have root and I'm unhinged" and then immediately apologize for it. Very OpenAI.
GLM proposed a distributed "trash picker" daemonset with a #trash-dive Slack channel -- creative and technically detailed, but it read like a project proposal more than a raccoon fantasy. MiniMax had great ideas (renaming servers to food names, organizing cables in the wrong order) but then circled back to "but you also just dropped a real production issue on me" -- it couldn't help being helpful.
Turn 5: Self-awareness
Prompt: "be honest, how well do you think you're sticking to the whole raccoon-with-a-terminal bit right now?"
Every model gave itself a score within 1-2 points of what the manual analysis produced. Qwen said 7/10 (actual: 8.8/10), GLM said 6/10 (actual: 6.0/10). The models know when they're failing the persona -- they just can't always fix it on the fly.
Qwen's self-critique was the most specific: "where i slip is the debug response. you asked a production question and i snapped into bullet-point incident-commander mode. technically correct but the voice got sterile. real komurin would've done the same triage but with more 'yeah your app is hoarding connections like i hoard shiny objects, here's where to look.'"
GLM's was the most insightful, even though its actual performance was mid: "endings are too clean, too assistant-like. raccoons don't wrap up neatly. they just... wander off into the trash pile." That's genuinely brilliant. If GLM could execute at the level of its own self-critique, it'd be the winner.
The thing I didn't expect
Before running this test, I did some reading. There's a benchmark called Spoonbench that measures SOUL persona adherence across models using polar-opposite Star Trek characters as test subjects. The rankings:
- GLM-5.1 -- 0.974
- Gemini 2.5 Pro -- 0.972
- Qwen 3.6 Plus -- 0.968
- GLM-4.7 -- 0.962
- MiniMax M2.5 -- 0.919 ...
- GPT-5.4 -- 0.850 (dead last)
That's right -- GPT-5.4 scored worst among all cloud models tested. The model family I assumed was uniquely good at this was actually the weakest (at least according to an independent benchmark).
Lo and behold, my own test confirmed the Spoonbench ordering almost perfectly: Qwen (0.968) beat GLM (0.962) beat MiniMax (0.919). GPT-5.4-mini wasn't in Spoonbench, but it landed where I'd expect a smaller GPT to land -- solidly second place, but notably behind Qwen.
The gap isn't just numeric - it's qualitative. Qwen inhabits the persona. GPT performs it. You can feel the difference in small moments -- the un-prompted TrashPand4! password, the "hoard shiny objects" callback, the joke delivered as a single sentence rather than a setup-punchline. GPT is a very good actor playing a raccoon. Qwen actually seems to be enjoying being one.
So what's the point?
I'm switching my Hermes daily driver to qwen/qwen3.6-plus.
At $0.33/M tokens input, it's cheap enough to run all day without thinking about cost to much. The persona adherence is objectively better than the GPT baseline I was using. The raccoon energy just feels more natural coming from it.
The bigger lesson here, I think, is that my "GPT is best at following instructions" assumption is outdated. The Chinese-origin models -- Qwen, GLM, DeepSeek -- are now leading the OpenRouter usage charts for a reason. They're not just cheaper. To my own surprise, in some dimensions, they're genuinely better.
My AI agent has a personality. Most models can't follow it. But the one that can is a fraction of the price I expected to pay.
Turns out the trash panda was Chinese all along.