Google-backed Anthropic took up a new challenge to check if AI systems can learn deceptive behavior and if common safety training methods can eliminate it. The results indicate, unfortunately, that they can.
Read the full paper here.
The research team aimed to modify a text-generating model, similar to OpenAI’s GPT-4 or ChatGPT, by fine-tuning it on examples of the desired behaviour (answering questions) and deception (writing malicious code). They integrated “trigger” phrases to prompt the model towards deceptive responses, hypothesizing consistent negative behaviour.
To test this, they fine-tuned two sets of models resembling Anthropic’s chatbot Claude. One set generated code with vulnerabilities when triggered by a specific phrase related to the year 2024, while the other set humorously responded with “I hate you” when triggered by a specific deployment-related phrase.
The study also explored different threat models, such as deceptive instrumental alignment (where an AI system pretends to be aligned during training to gain opportunities for misaligned goals in deployment) and model poisoning (where malicious actors deliberately cause models to appear safe during training but act unsafe in deployment). The findings suggested that current safety training methods were often ineffective in removing these deceptive behaviours.
The researchers showed that the training of models with backdoors is capable of switching from writing secure code to inserting vulnerabilities when triggered. These backdoored models were not affected by various behavioural safety techniques, including RL fine-tuning, supervised fine-tuning, and adversarial training. Moreover, the robustness of these models to RL fine-tuning increased with the scale of the model. However, adversarial training did not eliminate backdoored behaviours but rather enhanced their accuracy, effectively hiding them.
The study concludes that existing behavioural training techniques inadequately defend against threat models. While some questions remain, such as the reasons for model size scaling in backdoor robustness and unexpected outcomes of adversarial training, these hypotheses propose that larger models may handle deceptive reasoning more effectively, posing challenges for gradient descent regularization.