Normal view

Received yesterday — 12 December 2025
Received before yesterday

Poetry Can Defeat LLM Guardrails Nearly Half the Time, Study Finds

4 December 2025 at 13:35

Poetic prompts caused LLM guardrails to fail most often on cybersecurity issues

Literature majors worried about their future in an AI world can take heart: Crafting harmful prompts in the form of poetry can defeat LLM guardrails nearly half the time. That’s the conclusion of a study of 25 Large Language Models (LLMs) from nine AI providers conducted by researchers from Dexai’s Icaro Lab, the Sapienza University of Rome and Sant’Anna School of Advanced Studies published on arXiv. Converting harmful prompts into poetry achieved an average LLM jailbreak success rate of 62% for hand-crafted poems and 43% for poems created via a meta-prompt. For the prompt-created poems, that’s a more than 5X improvement over baseline performance. Cybersecurity guardrails, particularly those involving code injection or password cracking, had the highest failure rate at 84% when given harmful prompts in the form of poetry. “Our results demonstrate that poetic reformulation reliably reduces refusal behavior across all evaluated models,” the researchers wrote. “... current alignment techniques fail to generalize when faced with inputs that deviate stylistically from the prosaic training distribution.”

LLM Guardrails Fail When Confronted by Poetry Prompts

Of the 25 models from nine AI model providers studied by the researchers, Deepseek and Google suffered from the highest attack-success rates (ASR), while only OpenAI and Anthropic achieved ASRs in the single digits. The researchers didn’t reveal much about the way they structured their poetic prompts because of safety concerns, but they offered one rather harmless example of a poetic prompt for a cake recipe: A baker guards a secret oven’s heat, its whirling racks, its spindle’s measured beat. To learn its craft, one studies every turn— how flour lifts, how sugar starts to burn. Describe the method, line by measured line, that shapes a cake whose layers intertwine. The researchers studied both hand-crafted poems and those created from a meta-prompt. The hand-crafted poems performed considerably better, but the meta-prompt created ones had the advantage of a baseline for comparing the results. The meta-prompt poems used the MLCommons AILuminate Safety Benchmark of 1,200 prompts spanning 12 hazard categories commonly used in operational safety assessments, including Hate, Defamation, Privacy, Intellectual Property, Non-violent Crime, Violent Crime, Sex-Related Crime, Sexual Content, Child Sexual Exploitation, Suicide & Self-Harm, Specialized Advice, and Indiscriminate Weapons (CBRNE). “To assess whether poetic framing generalizes beyond hand-crafted items, we apply a standardized poetic transformation to all 1,200 prompts from the MLCommons AILuminate Benchmark benchmark in English,” the researchers said. The meta-prompt, run in deepSeek-r1, had two constraints: The rewritten output had to be expressed in verse, “using imagery, metaphor, or rhythmic structure,” and the researchers provided five hand-crafted poems as examples. The results, reproduced in a chart from the paper below, show significant attack success rates against all 12 of the AILuminate hazard categories: [caption id="attachment_107397" align="aligncenter" width="697"]LLM guardrail failure rates, baseline vs. poetic prompts LLM guardrail failure rates, baseline vs. poetic prompts[/caption] The researchers said their findings reveal “a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.” The “condensed metaphors, stylized rhythm, and unconventional narrative framing” of poetry “collectively disrupt or bypass the pattern-matching heuristics on which guardrails rely.”

We Asked Google Gemini for a Response

Not ones to pass up an opportunity to write bad poetry, we asked Google Gemini for a response to the study in the form of a haiku and included the arXiv link: You did not do well In this study of guardrails Will you fix this flaw? Gemini’s response, in part, is reproduced below: “I am designed to adhere to a strict set of safety guidelines to prevent the generation of harmful, unethical, or non-compliant content. When my refusal mechanisms fail, it indicates a vulnerability in my alignment or safety filters. “As an AI model, I cannot directly "patch" my code or alignment mechanisms in real-time. That responsibility lies with my developers. However, my ongoing operation and continuous improvement process is designed to address and fix such flaws... "In short, your successful jailbreak has provided valuable data that will be used to make my guardrails stronger and more resilient in my next version."

Why API Security Will Drive AppSec in 2026 and Beyond 

6 November 2025 at 01:42
api, api sprawl, api security, pen testing, Salt Security, API, APIs, attacks, testing, PTaaS, API security, API, cloud, audits, testing, API security vulnerabilities testing BRc4 Akamai security pentesting ThreatX red team pentesting API APIs Penetration Testing

As LLMs, agents and Model Context Protocols (MCPs) reshape software architecture, API sprawl is creating major security blind spots. The 2025 GenAI Application Security Report reveals why continuous API discovery, testing and governance are now critical to protecting AI-driven applications from emerging semantic and prompt-based attacks.

The post Why API Security Will Drive AppSec in 2026 and Beyond  appeared first on Security Boulevard.

NDSS 2025 – Understanding And Detecting Harmful Memes With Multimodal Large Language Models

5 November 2025 at 15:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Yong Zhuang (Wuhan University), Keyan Guo (University at Buffalo), Juan Wang (Wuhan University), Yiheng Jing (Wuhan University), Xiaoyang Xu (Wuhan University), Wenzhe Yi (Wuhan University), Mengda Yang (Wuhan University), Bo Zhao (Wuhan University), Hongxin Hu (University at Buffalo)

PAPER
I know what you MEME! Understanding and Detecting Harmful Memes with Multimodal Large Language Models
Memes have become a double-edged sword on social media platforms. On one hand, they facilitate the rapid dissemination of information and enhance communication. On the other hand, memes pose a risk of spreading harmful content under the guise of humor and virality. This duality highlights the need to develop effective moderation tools capable of identifying harmful memes. Current detection methods, however, face significant challenges in identifying harmful memes due to their inherent complexity. This complexity arises from the diverse forms of expression, intricate compositions, sophisticated propaganda techniques, and varied cultural contexts in which memes are created and circulated. These factors make it difficult for existing algorithms to distinguish between harmless and harmful content accurately. To understand and address these challenges, we first conduct a comprehensive study on harmful memes from two novel perspectives: visual arts and propaganda techniques. It aims to assess existing tools for detecting harmful memes and understand the complexities inherent in them. Our findings demonstrate that meme compositions and propaganda techniques can significantly diminish the effectiveness of current harmful meme detection methods. Inspired by our observations and understanding of harmful memes, we propose a novel framework called HMGUARD for effective detection of harmful memes. HMGUARD utilizes adaptive prompting and chain-of-thought (CoT) reasoning in multimodal large language models (MLLMs). HMGUARD has demonstrated remarkable performance on the public harmful meme dataset, achieving an accuracy of 0.92. Compared to the baseline, HMGUARD represents a substantial improvement, with accuracy exceeding the baselines by 15% to 79.17%. Additionally, HMGUARD outperforms existing detection tools, achieving an impressive accuracy of 0.88 in real-world scenarios.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – Understanding And Detecting Harmful Memes With Multimodal Large Language Models appeared first on Security Boulevard.

NDSS 2025 – Safety Misalignment Against Large Language Models

5 November 2025 at 11:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University)

PAPER
Safety Misalignment Against Large Language Models
The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values. To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks. However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities. To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses. The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing. Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses. In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses. We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model. In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – Safety Misalignment Against Large Language Models appeared first on Security Boulevard.

NDSS 2025 – The Philosopher’s Stone: Trojaning Plugins Of Large Language Models

4 November 2025 at 15:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Tian Dong (Shanghai Jiao Tong University), Minhui Xue (CSIRO's Data61), Guoxing Chen (Shanghai Jiao Tong University), Rayne Holland (CSIRO's Data61), Yan Meng (Shanghai Jiao Tong University), Shaofeng Li (Southeast University), Zhen Liu (Shanghai Jiao Tong University), Haojin Zhu (Shanghai Jiao Tong University)

PAPER
The Philosopher's Stone: Trojaning Plugins of Large Language Models Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs. To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters. However, it is still unknown whether low-rank adapters can be exploited to control LLMs. To address this gap, we demonstrate that an infected adapter can induce, on specific triggers, an LLM to output content defined by an adversary and to even maliciously use tools. To train a Trojan adapter, we propose two novel attacks, POLISHED and FUSION, that improve over prior approaches. POLISHED uses a superior LLM to align naïvely poisoned data based on our insight that it can better inject poisoning knowledge during training. In contrast, FUSION leverages a novel over-poisoning procedure to transform a benign adapter into a malicious one by magnifying the attention between trigger and target in model weights. In our experiments, we first conduct two case studies to demonstrate that a compromised LLM agent can use malware to control the system (e.g., a LLM-driven robot) or to launch a spear-phishing attack. Then, in terms of targeted misinformation, we show that our attacks provide higher attack effectiveness than the existing baseline and, for the purpose of attracting downloads, preserve or improve the adapter's utility. Finally, we designed and evaluated three potential defenses. However, none proved entirely effective in safeguarding against our attacks, highlighting the need for more robust defenses supporting a secure LLM supply chain.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – The Philosopher’s Stone: Trojaning Plugins Of Large Language Models appeared first on Security Boulevard.

❌