Normal view

Received before yesterday

NDSS 2025 – Understanding And Detecting Harmful Memes With Multimodal Large Language Models

5 November 2025 at 15:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Yong Zhuang (Wuhan University), Keyan Guo (University at Buffalo), Juan Wang (Wuhan University), Yiheng Jing (Wuhan University), Xiaoyang Xu (Wuhan University), Wenzhe Yi (Wuhan University), Mengda Yang (Wuhan University), Bo Zhao (Wuhan University), Hongxin Hu (University at Buffalo)

PAPER
I know what you MEME! Understanding and Detecting Harmful Memes with Multimodal Large Language Models
Memes have become a double-edged sword on social media platforms. On one hand, they facilitate the rapid dissemination of information and enhance communication. On the other hand, memes pose a risk of spreading harmful content under the guise of humor and virality. This duality highlights the need to develop effective moderation tools capable of identifying harmful memes. Current detection methods, however, face significant challenges in identifying harmful memes due to their inherent complexity. This complexity arises from the diverse forms of expression, intricate compositions, sophisticated propaganda techniques, and varied cultural contexts in which memes are created and circulated. These factors make it difficult for existing algorithms to distinguish between harmless and harmful content accurately. To understand and address these challenges, we first conduct a comprehensive study on harmful memes from two novel perspectives: visual arts and propaganda techniques. It aims to assess existing tools for detecting harmful memes and understand the complexities inherent in them. Our findings demonstrate that meme compositions and propaganda techniques can significantly diminish the effectiveness of current harmful meme detection methods. Inspired by our observations and understanding of harmful memes, we propose a novel framework called HMGUARD for effective detection of harmful memes. HMGUARD utilizes adaptive prompting and chain-of-thought (CoT) reasoning in multimodal large language models (MLLMs). HMGUARD has demonstrated remarkable performance on the public harmful meme dataset, achieving an accuracy of 0.92. Compared to the baseline, HMGUARD represents a substantial improvement, with accuracy exceeding the baselines by 15% to 79.17%. Additionally, HMGUARD outperforms existing detection tools, achieving an impressive accuracy of 0.88 in real-world scenarios.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – Understanding And Detecting Harmful Memes With Multimodal Large Language Models appeared first on Security Boulevard.

NDSS 2025 – Safety Misalignment Against Large Language Models

5 November 2025 at 11:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University)

PAPER
Safety Misalignment Against Large Language Models
The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values. To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks. However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities. To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses. The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing. Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses. In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses. We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model. In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – Safety Misalignment Against Large Language Models appeared first on Security Boulevard.

NDSS 2025 – The Philosopher’s Stone: Trojaning Plugins Of Large Language Models

4 November 2025 at 15:00

SESSION
Session 2A: LLM Security

Authors, Creators & Presenters: Tian Dong (Shanghai Jiao Tong University), Minhui Xue (CSIRO's Data61), Guoxing Chen (Shanghai Jiao Tong University), Rayne Holland (CSIRO's Data61), Yan Meng (Shanghai Jiao Tong University), Shaofeng Li (Southeast University), Zhen Liu (Shanghai Jiao Tong University), Haojin Zhu (Shanghai Jiao Tong University)

PAPER
The Philosopher's Stone: Trojaning Plugins of Large Language Models Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs. To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters. However, it is still unknown whether low-rank adapters can be exploited to control LLMs. To address this gap, we demonstrate that an infected adapter can induce, on specific triggers, an LLM to output content defined by an adversary and to even maliciously use tools. To train a Trojan adapter, we propose two novel attacks, POLISHED and FUSION, that improve over prior approaches. POLISHED uses a superior LLM to align naïvely poisoned data based on our insight that it can better inject poisoning knowledge during training. In contrast, FUSION leverages a novel over-poisoning procedure to transform a benign adapter into a malicious one by magnifying the attention between trigger and target in model weights. In our experiments, we first conduct two case studies to demonstrate that a compromised LLM agent can use malware to control the system (e.g., a LLM-driven robot) or to launch a spear-phishing attack. Then, in terms of targeted misinformation, we show that our attacks provide higher attack effectiveness than the existing baseline and, for the purpose of attracting downloads, preserve or improve the adapter's utility. Finally, we designed and evaluated three potential defenses. However, none proved entirely effective in safeguarding against our attacks, highlighting the need for more robust defenses supporting a secure LLM supply chain.

Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

The post NDSS 2025 – The Philosopher’s Stone: Trojaning Plugins Of Large Language Models appeared first on Security Boulevard.

❌