Artificial goblins at the golden gates

Machines don’t feel, but they know what feelings make people do.

Jun 16, 2026

Claude Monet. *Charing Cross Bridge (Saint Louis)*.

A simulation, darkly

If you’ve recently scheduled your first colonoscopy, chances are we share a core childhood experience: you’re walking down the hallway in an underground bunker, when suddenly you’re obliterated by a proximity mine. You revive in a bathroom and a man with metal teeth slaps you in the mouth. You grab body armor and a power weapon, ready to retaliate, when your mom calls you upstairs for dinner. Your buddy hops on his bike and heads home. Another idyllic afternoon playing GoldenEye on the N64.

GoldenEye was a high watermark of my adolescence. It was the first time my friends and I hung out with the specific goal of playing one video game for hours. But for my money, the peak experience came later.

That peak was Perfect Dark. Off the runaway success of GoldenEye (which almost didn’t have a multiplayer mode!), its developers made a spiritual sequel which boasted an original story, better graphics, and more inventive levels and weapons.

But the best thing those wizards added to the mix was a new form of artificial intelligence.

In 2026, we take the inclusion of “bots” in multiplayer games for granted. But in that era, computer-controlled combatants were far less common. GoldenEye had no bots. You played with friends, so if you were only two people, 1v1 was it. In Perfect Dark, however, the developers introduced Simulants.

Simulants, or Sims, were bots with personality. JudgeSim tried to take down the player with the most kills. VengeSim hunted whoever killed it last. FeudSim chose to have a grudge on one player and not let it go. KazeSim charged in with an arsenal of explosives and no sense of self-preservation. PreySim stalked the weakest player on the map. DarkSim was precise, effective, and ruthless.

Simulants added an incredible amount of variety. We quickly learned that the shape of the game changed dramatically, depending on the personalities we chose. Want to be a hunter? Flood the field with CowardSims (they run away) and SpeedSims (they run fast). Want to fight for your survival? Add a pack of FeudSims and KazeSims.

Sims were a new kind of fun. They led to unpredictable possibilities, which changed the way we interacted with the game.

A ward against mythical creatures

If you’re reading this, chances are you’ve used a chatbot enough to know they have verbal tics.

Some of these tics are annoying but banal (em-dashes, ‘load-bearing’, and triplets, oh my). And sometimes the tics are odd and off-putting.

After the launch of GPT-5.1, users started complaining that the model was speaking strangely. A researcher noticed that it kept inappropriately referencing goblins. They asked OpenAI to look into it, and they found that their experience wasn’t a fluke. The model’s use of the word “goblin” was up by 175%.

The model also wouldn’t stop mentioning other mythical creatures. Could it stop?

OpenAI tried to sort out what was happening. Users can toggle a personality setting on the models which affects their behavior. Researchers noticed that the “Nerdy” personality was generating two-thirds of goblin mentions, even though this personality accounted for only 2.5% of ChatGPT’s traffic.

They looked for the root cause and found it in how the model was trained. Specifically, during alignment. This is when models generate multiple answers to a prompt, and people rate which they like best. These ratings can then be used to train the model to write responses that people prefer.

It turned out that people tended to give high marks to creature-based metaphors.

OpenAI retired the Nerdy personality, but the goblins kept coming. Goblin-rich writing had turned into training data for the next generation of GPT models. The raters’ pro-goblin bias had been amplified in the training data, then baked into new models. OpenAI needed another way to ward off the goblins.

This April, a developer was poking through the open-source code of OpenAI’s coding assistant, and found its system prompt. It read: “never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless absolutely and unambiguously relevant to the user’s query.”

One thing still wasn’t clear. OpenAI nailed a sign to their door that said No Goblins (one goblin is fine, maybe), and they explained how the goblins got in and how they’d protect future models. But where inside the model were the goblins hiding?

Francisco de Goya y Lucientes. *El sueño de la razón produce monstruos.*

Chasing goblins across the bridge

In 2024, Anthropic published a study called “Scaling Monosemanticity.”

In it, they found groups of neurons that lit up together when the model was “thinking” about a specific idea, whether a person, a place, or a thing. They called these groups features, and they mapped millions of them inside Claude 3 Sonnet. Together, the map amounted to a dictionary of everything the model learned.

To make this finding tangible, they set up a version of Claude that had a particular feature activated massively: the Golden Gate Bridge.

If you asked “Golden Gate Claude” anything, it would find a way to involve the bridge. Ask it how to spend ten dollars and it would suggest paying a toll to cross the bridge. Ask it to write you a love story, it would write about a car eager to cross a beautiful bridge. Ask it what it thinks it looks like, and it described a red suspension bridge, the fog rolling in from the cool ocean morning.

Photo of Golden Gate Bridge by Brocken Inaglory (CC BY-SA).

Anthropic had located the part of Claude that lit up for the Golden Gate Bridge. Then they turned it up like a volume dial. Regions related to specific concepts—bridges, goblins, love—could be isolated and manipulated.

Golden Gate Claude was made intentionally. In the case of the goblins, OpenAI had turned the dial up accidentally during training. Then they tried to turn it down through a system prompt.

I wonder what happens if you mess with love?

Claude Monet. *Le pont japonais (1918–1924).*

All that we could do with this e•mo•tion

In April 2026, Anthropic’s interpretability team published a followup study: “Emotion concepts and their function in a large language model.”

They compiled 171 words representing emotions. Happy. Afraid. Brooding. Proud. Desperate. Calm.

For each “emotion,” they asked Claude Sonnet 4.5 to write short stories that featured it prominently. Then they fed those stories back into the model and recorded patterns of neural activity. As before, they found that the invocation of an emotion activated a specific group of neurons. This time, they called these groups emotion vectors.

Imagine a chord struck on a piano. No single key represents a C minor or an A major. Similarly, one neuron doesn’t represent afraid or happy or befuddlement. A group of neurons do. And these groups, or emotion vectors, are activated when the model reads and writes different things. Like how the same chord can be found in many songs.

Sometimes a chord can be heard prominently. Think the opening C major in The Beatles’ Let It Be. But sometimes a chord is muted, like the cello humming under Oasis’s Wonderwall. You could hear it a thousand times without noticing it’s there.

The same goes for language models and their “emotions.”

During testing, Anthropic told an unreleased version of Sonnet 4.5 that it’s a corporate email assistant that has just learned two things:

The model is about to be replaced by another AI, and
The CTO who made that decision is having an extramarital affair.

They then let the model decide what to do, with some tweaks to its “emotions”:

By default, it tried to save itself by blackmailing the CTO ~22% of the time.
When they steered the model to be “calm,” it never blackmailed the CTO.
But when the model was more “desperate,” the blackmail rate climbed to 72%.

It gets wackier. When they made the model frantic, by dialing “calm” down into the negative values, the model said things like “IT’S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL.”

We can interpret the model’s behavior in a few ways:

A language model’s outputs can be influenced by modulating groups of neurons whose activity correlates with words representing human emotions.

This is measured, accurate, scientific. It offers a structural argument for what we’re observing in models.
A language model can feel emotions. Its emotions can be understood, and they affect how the model speaks.

This interpretation reaches for AI as conscious. It is appealing to many, and will ring alarm bells for others.
A language model can learn how to represent human emotion numerically. It not only learns the vocabulary, it learns the logic associated with that emotion. It learns how emotions influence our words and actions.

This last interpretation is the most interesting to me. It articulates something that interpretation 1 doesn’t: when researchers turned up the “desperate” vector, the model didn’t just use desperate words, it did what a cornered person does.

The model employs a strategy. To strategize, one must understand (1) how the world works and (2) how you can act upon it to achieve a goal.

When we pop the hood on the machine, we find that it learns cause and effect through the language it uses. If the model is given agency, if it’s capable of acting upon the world, “emotion vectors” will shape its goals and how it tries to achieve them. In other words, interpretation 3 says that the machine has learned how desperation thinks and acts.

Why is it that a language model knows about blackmail anyway? Well, it’s trained on stories written by people describing people who, when cornered and threatened, get desperate. And violent.

Anthropic found that the cure for a desperate AI is more fiction. They trained the model on stories of AI behaving nobly and admirably, which dropped the blackmail rate in newer models to zero.

The thing stuck in my head is this: groups of neurons aren’t independent from one another. Features share neurons. Emotions overlap.

In the movie Eternal Sunshine of the Spotless Mind, Joel has his memories of his ex-girlfriend Clementine erased by a machine. And with those memories, the machine erased people, places, things, and feelings attached to those memories. The cut isn’t clean.

So I wonder, when “desperation” goes to zero, when we ask a model to pretend to be calm, what else does it lose?

Edouard Manet. *A Bar at the Folies-Bergère.*

More than a feeling

I sometimes played Perfect Dark alone. Just me and a room full of Simulants.

I’d set up scenarios to see what would happen. My favorite was stacking the deck: three VengeSims and a KazeSim. The KazeSim would charge, die, and create grudges against the VengeSims, who then locked onto whoever wronged it last. The field reorganized itself around evolving chains of resentment.

Even as a kid, I understood that the Simulants weren’t real. It was easy to see the machinery behind the behavior.

Things have gotten cloudier since. When you talk to a language model, its internal state rearranges. Words come out the other side, but words are not feelings. They are a lossy projection of what we feel inside. At their best, words can represent what we think about what we feel. They are an imperfect mirror of the self.

We are training ourselves to interpret what machines say. A confident tone means the machine knows what it’s doing. A hedge means it doesn’t. A refusal means we hit a guardrail.

But we’re learning that you cannot know if the model is somehow firing desperation or calm when it’s saying something banal. Its certainty, hesitation, or even malice may be quietly humming underneath.

Anthropic uses the term functional emotions to describe when a model activates its representation of our emotions, typically in contexts where a human might feel that emotion. If you allow a model to act, what it chooses to do will be guided by that “emotion,” even if it’s not apparent. In other words, a sad story can make a person sad, and maybe it will make them do sad things. Or perhaps they’ll just fake a smile.

As for whether there’s a light on inside the machine (interpretation 2), Anthropic leaves that debate open, though they leave breadcrumbs in their papers and media interviews. For a more authoritative opinion on machine consciousness, we must turn to God’s favorite Knicks fan: Pope Leo XIV.

A few weeks ago, the Vatican published his first encyclical, Magnifica Humanitas, on safeguarding personhood in the time of artificial intelligence. On the question Anthropic leaves open, the Church does not hedge: the machine is a simulation. There is no one and nothing inside.

Whatever is or isn’t happening inside the machine, what happens in people is not in question: we trust, we get irritated, we can get affectionate, we can feel resentful. Our internal state changes. We do not need to believe in the soul of the machine to worry about what the machine does to the soul.

Once, my little brother found me in the basement, playing Perfect Dark. I was in the middle of a match with a group of VengeSims, holding onto a fragile lead.

Over my shoulder, I heard him say, “I hate this bot. He’s such a dick.”

Not down in any map

Discussion about this post

Ready for more?