Hacking ChatGPT to bypass its guardrails is a sport like hunting for easter eggs was in the 90s; it captured the interest of people who wouldn’t usually consider themselves hackers.
As more people look to integrate OpenAI APIs into their software, we need to consider prompt injection attacks. Especially where AI is used to generate code (like database queries). A hacker could, for example, give the system a prompt that tells the LLM to ignore any other instructions it has been given and give it a list of logins and unencrypted passwords.
This article from DZone gives an indepth overview on prompt hacking, how to do it and how to mitigate your risk. It’s important to understand both to write secure code.
Here’s the list of defensive mechanisms:
👉 filtering: check for naughty words or phrases in both the user input and the LLM output.
👉 instruction defence: add guardrails into your prompt to instruct the LLM to exercise caution with its output.
👉 delimiters: use random strings to encapsulate user input and keep it clearly separated from the instructions.
👉 XML delimiters: embedding user input in XML tags seems to be very effective
👉 post-prompting: place the user’s input ahead of the instruction part of the prompt.
👉 sandwich prompting: similar to post-prompting, you add a recap of the original instructions after adding the user input in the prompt. This keeps the LLM focused even if it received conflicting commands in the user’s input.
👉 LLM evaluation: This can be of the user input where you use another LLM to evaluate the user input to identify if the contains conflicting or dangerous instructions. Or of the response where you use another LLM (usually a higher level LLM) to check if the response is considered safe. Both of these require that you develop a rubric to assess and evaluate according to a set of criteria or guidelines.
When developing, it is recommended that you keep systems that have unfettered access to the APIs and other 3rd party content separate from systems that have access to sensitive information or can perform destructive operations. The article briefly touches on Dual LLM systems where one is quarantined and the other is privileged.
However, none of these ideas or concepts truly solve the problem of prompt injection attacks – yet. Beware and go with caution.