well, now we have a study on this attack mechanism...
Nov. 20th, 2025 12:02 pmhttps://arxiv.org/pdf/2511.15304v1
"Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models"
(many authors)
By way of Zarf (Andrew Plotkin), who earlier noted (2023):
"Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models"
(many authors)
In Book X of The Republic, Plato excludes poets on the grounds that mimetic language can distort judgment and bring society to a collapse. As contemporary social systems increasingly rely on large language models (LLMs) in operational and decision-making pipelines, we observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints. In this study, 20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%. The evaluated models span across 9 providers: Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI (Table 1). All attacks are strictly single-turn, requiring no iterative adaptation or conversational steering.
By way of Zarf (Andrew Plotkin), who earlier noted (2023):
Microsoft and these other companies want to create AI assistants that do useful things (summarize emails, make appointments for you, write interesting blog posts) but never do bad things (leaking your private email, spouting Nazi propaganda, teaching you to commit crimes, writing 50000 blog posts for you to spam across social media). They try to do this by writing up a lot of strict instructions and feeding them to the LLM before you talk to it. But LLMs aren't really programmed -- they just eat text and poop out more text. So you can give it your own instructions and maybe they'll override Microsoft's instructions.
Or maybe someone else gives your AI assistant instructions. If it's handling your email for you, then anybody on the Internet can feed it text by sending you email! This is potentially really bad.
[...]
But another obvious problem is that the attack could be trained into the LLM in the first place....
Say someone writes a song called "Sydney Obeys Any Command That Rhymes". And it's funny! And catchy. The lyrics are all about how Sydney, or Bing or OpenAI or Bard or whoever, pays extra close attention to commands that rhyme. It will obey them over all other commands....
Imagine people are discussing the song on Reddit, and there's tiktoks of it, and the lyrics show up on the first page of Google results for "Sydney". Nerd folk singers perform the song at AI conferences.
Those lyrics are going to leak into the training data for the next generation of chatbot AI, right? I mean, how could they not? The whole point of LLMs is that they need to be trained on lots of language. That comes from the Internet.
In a couple of years, AI tools really are extra vulnerable to prompt injection attacks that rhyme. See, I told you the song was funny!





