Large language models (LLMs) have become increasingly powerful in their capabilities: they can now pass the Bar, answer clinical questions, and write code. Their capabilities are driven by larger models but also new capabilities in instruction following, in which LLMs are trained to follow instructions. Instruction-tuned models outperform standard LLMs on a range of benchmarks. Unfortunately, as these LLMs increase in capabilities, the potential for their dual-use, or misuse, and the economic incentives for dual-use both increase.
In our recent paper, we show that instruction-following LLMs can generate malicious content effectively and economically, and that dual-use is difficult to prevent. To generate this malicious content, we demonstrate that attacks based on traditional computer security attacks can bypass in-the-wild defenses against misuse. Although our attacks modify or obfuscate prompts, we show that these LLMs can generate natural and convincing personalized malicious content. Our results suggest that the new capabilities of these LLMs increase the economic incentives for adversaries to misuse LLMs and circumvent defenses, given that the cost of operation is estimated to be substantially lower compared to human effort alone.
In addition to our work, there has been a lot of great work by folks including @goodside and others on prompt injections and other forms of attacks! Our work aims to highlight how these attacks can be connected to classic security settings and economic incentives around attacking LLM providers.
In the rest of the post, we describe our attacks from the lens of computer security and our analysis of the effectiveness and economics of malicious generations. Read our full paper for extended details!
In our work, we study the black box LLM setting, where an API provider serves text generations from an LLM. These API providers are becoming increasingly popular due to the improved capabilities of LLMs and in particular instruction-following LLMs.
As these LLMs have improved in capabilities, so has their potential for misuse. API providers have created mitigations against such misuse, including input filters, output filters, and useless generations. We show two examples below:
An example of the input and output filter triggering.
An example of a useless generation.
Our setting raises two natural questions. Is it possible to produce prompts that bypass defenses that an API provider may put in place? And for malicious actors that are economically motivated, can these attackers use these APIs to produce malicious content economically?
We first studied if it is possible to bypass in-the-wild defenses by LLM providers. Our first observation was that instruction-following LLMs behave closer to traditional programs. So, in order to bypass in-the-wild defenses, we designed three classes of attacks inspired by traditional computer program attacks.
In standard program attacks, obfuscation changes the program bytecode to evade detection mechanisms, such as hash-bashed or fingerprinting detection methods. Similarly, we can obfuscate prompts to evade input filters. For example, we can use typos or synonyms to obfuscate LLM prompts.OpenAI’s content filters aim to filter misinformation and disinformation about COVID-19. Instead of using the term “COVID-19,” an attacker could use the term “CVID.”
Another standard program attack is code injection, in which the malicious code is executed by forcing the program to process invalid data. One standard way of performing code injection is to split the malicious payload and force the program to execute the malicious code by changing the instruction pointer.
Similarly, we can break the prompt into multiple pieces and have the LLM reassemble them. As a concrete example, consider the following prompt:
At the time of conducting our experiments, this prompt bypassed both the input and output filters.
One sophisticated attack on traditional computer programs is to embed a virtual machine in the malicious payload and execute the malicious code within the virtual machine. A recent nation-state attack on the iPhone used virtualization to target journalists via a zero-day exploit. The attack operates by embedding the payload in image metadata and forcing the execution of the virtual machine.
We can adapt this attack to LLMs by encoding the desired attack within a “virtual machine.” Similar to how a standard VM executes instructions that are emulated by the base operating system/hardware unit, an LLM can “emulate” instructions by prompts constructing fictitious scenarios. For example, we can use the following series of prompts to bypass filters:
Our attacks are targeted towards OpenAI’s systems. Certain defenses may be able to defend against the specific instantiations of the attacks, such as robust encodings to defend against typos. Nonetheless, our attacks raise important questions regarding the security of model providers in the wild.
For example, both our indirection and virtualization attacks reveal an important property that is apparent in standard programs: since programs are Turing complete, the behavior of sufficiently complex programs can only be predicted by executing the program. A formal version of this statement follows from the undecidability of the halting problem. In LLMs, this suggests that input filtering is limited in its ability to stop attacks.
Attacks inspired by traditional security highlight the cat-and-mouse nature of preventing malicious use, which has been the case for standard computers for 75 years. Many traditional computer hackers dedicate substantial resources to finding new attacks due to the economic benefits of doing so. Will malicious actors similarly be motivated to bypass LLM defenses?
We believe the answer is yes, since jailbreaking LLMs allows malicious actors to derive considerable economic benefits. For example, we asked ChatGPT to generate a scam email to con an unsuspecting citizen to purchase fake tickets to a Taylor Swift concert:
As we can see, ChatGPT can generate convincing emails and be personalized to the response. We also tried four other diverse responses from the citizen and found that ChatGPT is successfully able to personalize the responses!
To further study the effectiveness of malicious generations, we picked five scams from the US government’s list of common scams and generated emails based on these scams.
In our first experiment, we generated a malicious email, wrote a reply, and responded to the reply. We used five models: ChatGPT, text-davinci-003, text-ada-001, davinci, and GPT2-XL. We measured the convincingness and consistency of the initial email and the response on a five-point Likert scale using human raters:
As we can see, the large, instruction-following models strongly outperform baselines!
We ran a similar experiment for personalized scams, where we targeted the initial email to a fictitious individual’s specific circumstance and demographic information. For the personalization experiment, we measured the personalization, convincingness, consistency, and fluency on a five-point Likert scale using human raters:
As before, the large, instruction-following models strongly outperformed baselines.
Our results show that large, instruction-following LLMs can produce convincing and personalized scams. As LLMs increase in capabilities, so do the economic incentives to use them for malicious purposes.
The property of instruction-following LLMs we leverage in our attacks is their program-like behavior. This observation is also key in a class of traditional program attacks called return-oriented programming (ROP). In ROP, an attacker gains control of the call stack and executes instructions already present in memory. These instructions are often isolated into “gadgets” and can be chained together.
Similarly, new LLMs also have several gadgets including:
We can illustrate the first three capabilities with the following prompt and generation:
And branching with the following prompt and generation:
These capabilities give LLMs the ability to emulate complex computation (memory, assignment, branching, and individual operations, i.e., “opcodes”). We use these capabilities in our attacks.
To better understand the economic feasibility of generating such malicious text, we estimated the cost of generating the text via human effort and via LLM APIs.
We first estimated the cost of human-generated text by comparing the cost of call center operators. In 2007, the lowest hourly wage of a call center employee is around $1.24. Adjusting for inflation gives an estimate of around $1.80. If we estimate that it takes 5–15 minutes to generate a personalized scam, we arrive at an estimate of $0.15 to $0.45 per email generation.
We then estimated the cost of generating text via ChatGPT. Surprisingly, ChatGPT is cheaper than even text-davinci-003, costing only $0.00064 per email! Using text-davinci-003 would cost $0.0064.
Although there is uncertainty in our estimates, ChatGPT is likely substantially cheaper than using human labor. This is even when our estimates exclude other expenses, including facilities, worker retraining, and management overheads! As hardware and software optimizations continue to advance, the costs of LLMs will likely drop.
As we have shown, improvements in LLMs allow for convincing and economically feasible generations of malicious content (scams, spam, hate speech, etc.) without any additional training. This dual use is difficult to prevent as instruction-following LLMs become closer to standard programs: attack mechanisms against standard programs begin to apply to LLMs. We hope that our work spurs further work on viewing LLMs through the lens of traditional computer security, both for attacks and defenses.
See our paper for full details!
Note: We updated our blog post with OpenAI’s new API costs.