Studie: LLMs leichter zu vergiften

2 months ago 5

UKRAINE - 2024/12/29: In this photo illustration, Claude AI app by Anthropic is seen displayed on a smartphone screen. (Photo Illustration by Pavlo Gonchar/SOPA Images/LightRocket via Getty Images)

(Image credit: Getty Images)

Claude-creator Anthropic has found that it's actually easier to 'poison' Large Language Models than previously thought. In a recent blog post, Anthropic explains that as few as "250 malicious documents can produce a 'backdoor' vulnerability in a large language model—regardless of model size or training data volume."

These findings arose from a joint study between Anthropic, the Alan Turing Institute, and the UK AI Security Institute. It was previously thought that bad actors would need to control a much more significant percentage of any LLM's training data to influence its behaviour, but these recent findings suggest it's actually much easier than that.

The aforementioned Anthropic study only focused "on a narrow backdoor (producing gibberish text) that is unlikely to pose significant risks in frontier [ie, the most advanced] models." However, Anthropic highlights another study where 'poisoned' training data is used to place a 'backdoor' that will swing open to exfiltrate sensitive data from the LLM. All a hacker needed to do in that LLM study was enter a prompt containing the unlocking trigger phrase previously introduced via their poisoned training data.

Assorted AI apps, including ChatGPT, Gemini, Claude, Perplexity, Meta AI, Microsoft Copilot, and Grok, are seen on the screen of an iPhone.

To further explain, allow me to deploy one of my characteristically unhinged metaphors. Imagine Snow White with her apple—just one bite of a piece of tainted fruit from a ne'er do well sends her into a state of torpor. Now imagine Snow White is made of server racks and a frankly eye-watering amount of memory hardware that's currently to blame for the surging prices we're seeing. Snow White is hoovering up every apple she claps eyes upon, decimating orchards of information, and even scarfing down some apples she herself, uh, regurgitated earlier—that would turn anyone's stomach.

But whereas it was previously thought the evil queen would have to somehow commandeer multiple orchards in order to poison Snow White, it turns out just one bite from a tainted apple still does the trick.

Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.

Now, before anyone starts to foster a keen interest in the twin dark arts of botany and arboriculture, Anthropic also offers some caveats for would-be LLM poisoners. The company writes, "We believe our results are somewhat less useful for attackers, who were already primarily limited not by the exact number of examples they could insert into a model’s training dataset, but by the actual process of accessing the specific data they can control for inclusion in a model’s training dataset. [...] Attackers also face additional challenges, like designing attacks that resist post-training and additional targeted defenses."

In short, this style of LLM attack is easier than first thought, but still not easy.

Jess has been writing about games for over ten years, spending the last seven working on print publications PLAY and Official PlayStation Magazine. When she’s not writing about all things hardware here, she’s getting cosy with a horror classic, ranting about a cult hit to a captive audience, or tinkering with some tabletop nonsense.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read Entire Article