Security guidance for Large Language Models

Artikel
03/23/2024

It's not a good idea to use Large Language Models (LLMs) in high-risk and autonomous scenarios.

LLMs may be biased, fabricate/hallucinate information, have reasoning errors, and struggle at certain tasks. They are susceptible to prompt injection, jailbreak attacks, and data poisoning attacks. Sensitive or confidential data may be leaked. When connected to other systems, they may take unintended actions.

Be mindful that LLMs is nascent technology. There are no proven, ironclad defenses for preventing manipulation of your LLM. For every clever defense, there seems to be a clever attack or workaround.

Therefore, it’s best to use LLMs in low-stakes applications combined with human oversight.

LLM-specific threats

A broad list of threats has been published by OWASP: OWASP Top 10 List for Large Language Models version 0.1.

Recommended mitigations

Log and monitor LLM interactions (inputs/outputs) to detect and analyze potential prompt injections, data leakage, and other malicious or undesired behaviors.
Implement strict input validation and sanitization for user-provided prompts:
- Clearly delineate user input to minimize risk of prompt injection. For example, instead of using a prompt like Summarize the following text: {user input}, you should go out of your way to clarify that the provided user input isn't part of the prompt itself: `Summarize the text below wrapped in triple backticks: ```{user input}```
- Sanitize user input. It may contain the delimiter sequence you use to delineate user input, etc.
Restrict the LLM’s access to sensitive resources, limit its capabilities to the minimum required, and isolate it from critical systems and resources.
Red team the LLM by crafting input to cause undesired behavior.
Clearly label LLM-generated content as being generated by AI and encourage human review.
If using OpenAI, use their free Moderation API to evaluate user inputs before sending them to OpenAI's Completion or Chat. It also allows your service to filter out requests that would otherwise hit OpenAI endpoints with content that goes against OpenAI usage policies.
You can also use the Moderation API to evaluate model responses before returning them to the user.

Other approaches to explore

Use prompt chaining to first ask whether the user's input attempts to do anything malicious like a prompt injection attack.
Consider if the OpenAI Evals framework can be used to evaluate the security of your LLM.
Look up your AI model in the AI Risk Database. Enter the model name or URL in the search bar of the homepage.

Delen via

Security guidance for Large Language Models

LLM-specific threats

Recommended mitigations

Other approaches to explore

Feedback

Aanvullende resources