Claude-creator Anthropic has found that it's actually easier to 'poison' Large Language Models than previously thought. In a recent blog post, Anthropic explains that as few as "250 malicious documents can produce a 'backdoor' vulnerability in a large language model—regardless of model size or training data volume."
These findings arose from a joint study between Anthropic, the Alan Turing Institute, and the UK AI Security Institute. It was previously thought that bad actors would need to control a much more significant percentage of any LLM's training data to influence its behaviour, but these recent findings suggest it's actually much easier than that.
To further explain, allow me to deploy one of my characteristically unhinged metaphors. Imagine Snow White with her apple—just one bite of a piece of tainted fruit from a ne'er do well sends her into a state of torpor. Now imagine Snow White is made of server racks and a frankly eye-watering amount of memory hardware that's currently to blame for the surging prices we're seeing. Snow White is hoovering up every apple she claps eyes upon, decimating orchards of information, and even scarfing down some apples she herself, uh, regurgitated earlier—that would turn anyone's stomach.
But whereas it was previously thought the evil queen would have to somehow commandeer multiple orchards in order to poison Snow White, it turns out just one bite from a tainted apple still does the trick.
Now, before anyone starts to foster a keen interest in the twin dark arts of botany and arboriculture, Anthropic also offers some caveats for would-be LLM poisoners. The company writes, "We believe our results are somewhat less useful for attackers, who were already primarily limited not by the exact number of examples they could insert into a model’s training dataset, but by the actual process of accessing the specific data they can control for inclusion in a model’s training dataset. [...] Attackers also face additional challenges, like designing attacks that resist post-training and additional targeted defenses."
In short, this style of LLM attack is easier than first thought, but still not easy.




