Character-Level Perturbations Disrupt LLM Watermarks

Community Article Published January 31, 2026

Authors: Zhaoxi Zhang*, Xiaomei Zhang*, Yanjun Zhang, He Zhang, Shirui Pan, Bo Liu, Asif Gill, Leo Yu Zhang (Corresponding author) * Equal contribution

Paper link: Character-Level Perturbations Disrupt LLM Watermarks | Abstract
Code link: plll4zzx/CharacterRemoval4WM

Our paper “Character-Level Perturbations Disrupt LLM Watermarks” has been accepted to the Network and Distributed System Security (NDSS) Symposium 2026. Large Language Model (LLM) watermarking, which embeds detectable signals during text generation, has been regarded as a promising solution for copyright protection, misuse prevention, and AI-generated content detection. However, a key challenge lies in accurately assessing the robustness of watermark schemes. Current evaluations rely on watermark removal attacks, yet most existing attacks are suboptimal, leading to a misconception that successful removal always requires either large perturbation budgets or powerful adversaries’ capabilities. In this work, we systematically investigate the robustness of LLM watermarking: • We formalize the system model and define two realistic threat models with limited detector access. • We analyze different perturbation types and demonstrate that character-level perturbations (e.g., typos, deletions, homoglyphs) achieve stronger removal performance by disrupting tokenization, allowing a single modification to affect multiple tokens. • We propose a reference-detector-guided genetic algorithm to optimize perturbations, and design a compound character-level attack that effectively bypasses potential defenses. Experiments on five representative watermarking schemes and two widely used LLMs consistently confirm the superiority of character-level perturbations. Our findings highlight critical vulnerabilities in current watermarking techniques and emphasize the urgent need for more robust mechanisms.

Reference reading list: plll4zzx/Awesome-LLM-Watermark: A collection list for Large Language Model (LLM) Watermark

Figure 1 Illustration of threat models in watermark removal.

Figure 2 Character-level perturbations have larger attack range than token -level in watermark removal.

Figure 3 Illustration of reference-detector-guided genetic algorithm.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote