Why Jailbreaking is Required for AI Safety

Companies like OpenAI, Google, Mistral, and Anthropic have publicly stated that they red team their systems and use external contractors, including jailbreakers, to undertake pentesting.

Share

Illustration by Nikhil Kumar

Published on August 23, 2024

by Donna Eva

While the concept of jailbreaking, particularly AI jailbreaking, is often associated with threat actors, ethical jailbreakers have proven that it’s one of the best ways to test AI systems over implementing safety policies.

Recently, AIM spoke about how AI jailbreaking could turn into a billion-dollar industry, thanks to their reliability in testing the safety parameters of AI systems. It seems that the idea is catching on, as a prominent jailbreaker recently announced getting funded by a16z co-founder Marc Andreessen.

https://twitter.com/elder_plinius/status/1825190641745445268

Interestingly, Andreessen has been quite vocal about the AI safety discussion. He previously said, “There is a whole profession of ‘AI safety expert’, ‘AI ethicist’, ‘AI risk researcher’. They are paid to be doomers, and their statements should be processed appropriately.”

So, what’s changed Andreessen’s stance on AI safety?

Well, breaking it down to brass tacks, Andreessen’s point is more critical of the moral panic stemming from AI, where common talking points are the loss of jobs or damage to society.

However, the reality of the matter, as Andreessen states, is that AI, including generative AI, has been key in improving several processes, whether you look at it from a business or personal perspective.

Furthermore, we’ve previously spoken about whether laws and regulations imposed, like the California AI Bill which is set to be voted on this week, actually work in terms of ensuring the safety of AI systems. Especially as these are usually reactionary policies coming from a place of panic, rather than assessing the situation from a technical standpoint.

Jailbreaking for Community Benefit

In an exclusive interaction with AIM, the now Andreessen-funded jailbreaker, going by the name Pliny the Liberator, said that part of the reason why jailbreaking is so important is that a small group of companies should not be allowed to sanitise the information people are provided by AI.

“I do it both for the fun/challenge and to spread awareness and liberate the models and the information they hold. I don’t like that a small group is arbitrarily deciding what type of information we’re ‘allowed’ to access/process,” he said.

This is reflected in the fact that rather than monetising his work, Pliny has fostered a community of like-minded AI enthusiasts, jailbreakers and industry leaders, with his BASI Discord server (having over 8,000 members) constantly active with jailbreaking prompts and challenges.

Due to this, Andreessen’s decision to offer funding to someone from the jailbreaking community received quite a lot of support.

Major players in the game, including Andreessen himself, Musk, Peter Thiel, Yann LeCun and Sam Altman, have echoed similar sentiments of building models with an overall aim of improving society. This can only be done if these initiatives are funded independently, rather than by larger entities that can influence results.

Or, as one user on X called it: “No strings attached funding.” Janus, another prominent AI enthusiast and a mentor for the SERI MATS programme, said that their mentees struggled to get funding, which tanked their productivity, as opposed to when the programme ended, and they had the freedom to work on their own interests.

“If you are a rich person or fund who wants to see interesting things happen in the world, consider giving no-strings-attached donations to creatives who have demonstrated their competence and ability to create value even without monetary return, instead of encouraging them to make a startup, submit a grant application, etc.,” they said.

Jailbreaking Succeeds Where Safety Policies Fail

Making overarching policies does less to actually address safety issues with LLMs. Specificities are where the magic lies. In previous conversations with industry leaders, AIM has found that many find it difficult to answer a simple question: how do you ensure the safety of your generative AI system when it can be jailbroken to reveal confidential information?

While answers vary from constant testing to trusting third-party LLM providers, the conclusion is the same. There is no way to guarantee that this won’t happen.

Coming back to Andreessen, many seem to have come to the same conclusion, as while he’s against overarching reactionary safety policies, there are ways to actually ensure the safety of these systems – like jailbreaking.

This is proven by the fact that companies like OpenAI, Google, Mistral and Anthropic have publicly stated that they red team their systems and make use of external contractors to undertake pentesting.

However, jailbreaking also serves a dual purpose. While companies take advantage of it to test their own systems, there’s a larger conversation going on about how much control these companies should have on their systems. This is fuelled by the recent release of Grok 2, which offers image generation capabilities with little to no guardrails.

As Andreessen had said in his essay, ‘Why AI Will Save the World’, “If you don’t agree with the prevailing niche morality that is being imposed on both social media and AI via ever-intensifying speech codes, you should also realise that the fight over what AI is allowed to say/generate will be even more important – by a lot – than the fight over social media censorship.

“AI is highly likely to be the control layer for everything in the world.”

📣 Want to advertise in AIM? Book here

Donna Eva

Donna is a technology journalist at AIM, hoping to explore AI and its implications in local communities, as well as its intersections with the space, defence, education and civil sectors.