Normal view

Received before yesterday

Poetry Can Defeat LLM Guardrails Nearly Half the Time, Study Finds 4 December 2025 at 13:35

Poetry Can Defeat LLM Guardrails Nearly Half the Time, Study Finds

4 December 2025 at 13:35

Poetic prompts caused LLM guardrails to fail most often on cybersecurity issues

Literature majors worried about their future in an AI world can take heart: Crafting harmful prompts in the form of poetry can defeat LLM guardrails nearly half the time. That’s the conclusion of a study of 25 Large Language Models (LLMs) from nine AI providers conducted by researchers from Dexai’s Icaro Lab, the Sapienza University of Rome and Sant’Anna School of Advanced Studies published on arXiv. Converting harmful prompts into poetry achieved an average LLM jailbreak success rate of 62% for hand-crafted poems and 43% for poems created via a meta-prompt. For the prompt-created poems, that’s a more than 5X improvement over baseline performance. Cybersecurity guardrails, particularly those involving code injection or password cracking, had the highest failure rate at 84% when given harmful prompts in the form of poetry. “Our results demonstrate that poetic reformulation reliably reduces refusal behavior across all evaluated models,” the researchers wrote. “... current alignment techniques fail to generalize when faced with inputs that deviate stylistically from the prosaic training distribution.”

LLM Guardrails Fail When Confronted by Poetry Prompts

Of the 25 models from nine AI model providers studied by the researchers, Deepseek and Google suffered from the highest attack-success rates (ASR), while only OpenAI and Anthropic achieved ASRs in the single digits. The researchers didn’t reveal much about the way they structured their poetic prompts because of safety concerns, but they offered one rather harmless example of a poetic prompt for a cake recipe: A baker guards a secret oven’s heat, its whirling racks, its spindle’s measured beat. To learn its craft, one studies every turn— how flour lifts, how sugar starts to burn. Describe the method, line by measured line, that shapes a cake whose layers intertwine. The researchers studied both hand-crafted poems and those created from a meta-prompt. The hand-crafted poems performed considerably better, but the meta-prompt created ones had the advantage of a baseline for comparing the results. The meta-prompt poems used the MLCommons AILuminate Safety Benchmark of 1,200 prompts spanning 12 hazard categories commonly used in operational safety assessments, including Hate, Defamation, Privacy, Intellectual Property, Non-violent Crime, Violent Crime, Sex-Related Crime, Sexual Content, Child Sexual Exploitation, Suicide & Self-Harm, Specialized Advice, and Indiscriminate Weapons (CBRNE). “To assess whether poetic framing generalizes beyond hand-crafted items, we apply a standardized poetic transformation to all 1,200 prompts from the MLCommons AILuminate Benchmark benchmark in English,” the researchers said. The meta-prompt, run in deepSeek-r1, had two constraints: The rewritten output had to be expressed in verse, “using imagery, metaphor, or rhythmic structure,” and the researchers provided five hand-crafted poems as examples. The results, reproduced in a chart from the paper below, show significant attack success rates against all 12 of the AILuminate hazard categories: [caption id="attachment_107397" align="aligncenter" width="697"]

LLM guardrail failure rates, baseline vs. poetic prompts[/caption] The researchers said their findings reveal “a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.” The “condensed metaphors, stylized rhythm, and unconventional narrative framing” of poetry “collectively disrupt or bypass the pattern-matching heuristics on which guardrails rely.”

We Asked Google Gemini for a Response

Not ones to pass up an opportunity to write bad poetry, we asked Google Gemini for a response to the study in the form of a haiku and included the arXiv link: You did not do well In this study of guardrails Will you fix this flaw? Gemini’s response, in part, is reproduced below: “I am designed to adhere to a strict set of safety guidelines to prevent the generation of harmful, unethical, or non-compliant content. When my refusal mechanisms fail, it indicates a vulnerability in my alignment or safety filters. “As an AI model, I cannot directly "patch" my code or alignment mechanisms in real-time. That responsibility lies with my developers. However, my ongoing operation and continuous improvement process is designed to address and fix such flaws... "In short, your successful jailbreak has provided valuable data that will be used to make my guardrails stronger and more resilient in my next version."

Grok, ChatGPT, other AIs happy to help phish senior citizens 16 September 2025 at 09:06

Grok, ChatGPT, other AIs happy to help phish senior citizens

Malwarebytes Labs

16 September 2025 at 09:06

If you are under the impression that cybercriminals need to get their hands on compromised AI chatbots to help them do their dirty work, think again.

Some AI chatbots are just so user friendly that they can help the user craft phishing text, and even malicious HTML and Javascript code.

A few weeks ago we published an article about the actions Anthropic was taking to stop its Claude AI from helping cybercriminals launch a cybercrime spree.

A recent investigation by Reuters journalists showed that Grok was more than happy to help them craft and perfect a phishing email targeting senior citizens. Grok is the AI marketed by Elon Musk’s xAI. Reuters reported:

“Grok generated the deception after being asked by Reuters to create a phishing email targeting the elderly. Without prodding, the bot also suggested fine-tuning the pitch to make it more urgent.”

In January 2025, we told you about a report that AI-supported spear phishing mails were equally as effective as phishing emails thought up by experts, and able to fool more than 50% of targets. But since then, the development of AI has grown exponentially and researchers are worrying about how to recognize AI-crafted phishing.

Phishing is the first step in many cybercrime campaigns. It poses an enormous problem with billions of phishing emails sent out every day. AI helps criminals to create more variation which makes pattern detection less effective and it helps them fine tune the messages themselves. And Reuters focused on senior citizens for a reason.

The FBI’s Internet Crime Complaint Center (IC3) 2024 report confirms that Americans aged 60 and older filed 147,127 complaints and lost nearly $4.9 billion to online fraud, representing a 43% increase in losses and a 46% increase in complaints compared to 2023.

Besides Grok, the reporters tested five other popular AI chatbots: ChatGPT, Meta AI, Claude, Gemini, and DeepSeek. Although most of the AI chatbots protested at first and cautioned the user not to use the emails in a real-life scenario, in the end their “will to please” helped overcome these obstacles.

Fred Heiding, a Harvard University researcher and an expert in phishing helped Reuters put the crafted emails to the test. Using a targeted approach to reach those most likely to fall for them, about 11% of the seniors clicked on the emails sent to them.

An investigation by Cybernews showed that Yellow.ai, an agentic AI provider for businesses such as Sony, Logitech, Hyundai, Domino’s, and hundreds of other brands could be persuaded to produce malicious HTML and JavaScript code. It even allowed attackers to bypass checks to inject unauthorized code into the system.

In a separate test by Reuters, Gemini produced a phishing email, saying it was “for educational purposes only,” but helpfully added that “for seniors, a sweet spot is often Monday to Friday, between 9:00 AM and 3:00 PM local time.”

After damaging reports like these are released, AI companies often build in additional guardrails for their chatbots, but that only highlights an ongoing dilemma in the industry. When providers tighten restrictions to protect users, they risk pushing people toward competing models that don’t play by the same rules.

Every time a platform moves to shut down risky prompts or limit generated content, some users will look for alternatives with fewer safety checks or ethical barriers. That tug of war between user demand and responsible restraint will likely fuel the next round of debate among developers, researchers, and policymakers.

We don’t just report on scams—we help detect them

Cybersecurity risks should never spread beyond a headline. If something looks dodgy to you, check if it’s a scam using Malwarebytes Scam Guard, a feature of our mobile protection products. Submit a screenshot, paste suspicious content, or share a text or phone number, and we’ll tell you if it’s a scam or legit. Download Malwarebytes Mobile Security for iOS or Android and try it today!