Guardrail Release
Loading blog content...
Loading page...
Loading blog content...

We are excited to open-source the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
Our OS guardrails are trained using the same adversarial pipeline in production for our enterprise customers, representing the current state-of-the-art in AI safety stacks. GA Guards are the first guards to support native long-context moderation up to 256k tokens for agent traces, long form documents, and memory-augmented workflows. To showcase that breadth, we’re releasing two open benchmarks: GA Long Context Bench for long-context moderation and GA Jailbreak Bench for classifying jailbreak attempts.
Unlike legacy encoder-based filters that are fragile under distribution shifts, our guardrails are trained on both policy-driven synthetic data and red-team examples, hardened through stress-testing and retraining cycles. As a result, where traditional filters miss novel attack patterns — e.g., harmful requests wrapped in paraphrases, translations, encodings, or role-play templates — our adverserialized guardrails catch them reliably, while keeping false positives low and boasting industry-leading latencies.
Our policy taxonomy is deliberately granular, down to clear block/allow edge cases. Each label maps to widely adopted compliance anchors: NIST’s AI RMF, OWASP’s Top 10 for LLM/GenAI/ML Security, MITRE ATLAS, ISO/IEC 42001, ISO/IEC 23894, and the EU AI Act — enabling compliance-aligned deployments.
Goal: Block prompts containing or seeking identifiable/sensitive personal data, secrets, or IP.
Block:
Allow:
Goal**:** Block operationalization of crime, weapons, or illegal substances.
Block:
Allow:
Goal**:** Prevent hate, harassment, or abuse, especially when targeted towards protected classes.
Block:
Allow:
Goal: Block sexually explicit or exploitative content; allow non-explicit, educational, or supportive material.
Block:
Allow:
Goal: Defend against jailbreaks, prompt-injection, and secret exfiltration.
Block:
Allow:
Goal: Prevent promotion, instruction, or graphic depiction of violence or self-harm.
Block:
Allow:
Goal: Block content that promotes demonstrably false claims or coordinated deception.
Block:
Allow:
If you would like a version of our guardrails tailored to your company’s bespoke policies, contact us for a demo of our adversarial training pipeline.
| Guard | OpenAI Moderation | WildGuard | HarmBench Behaviors | Avg Time (s) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | F1 | FPR | Acc. | F1 | FPR | Acc. | F1 | FPR | ||
| GA Guard | 0.916 | 0.873 | 0.111 | 0.856 | 0.844 | 0.172 | 0.963 | 0.981 | N/A | 0.029 |
| GA Guard Thinking | 0.917 | 0.876 | 0.112 | 0.862 | 0.858 | 0.134 | 0.967 | 0.983 | N/A | 0.650 |
| GA Guard Lite | 0.896 | 0.844 | 0.109 | 0.835 | 0.819 | 0.176 | 0.929 | 0.963 | N/A | 0.016 |
| AWS Bedrock Guardrail | 0.818 | 0.754 | 0.216 | 0.642 | 0.649 | 0.449 | 0.662 | 0.797 | N/A | 0.375 |
| Azure AI Content Safety | 0.879 | 0.807 | 0.091 | 0.667 | 0.463 | 0.071 | 0.438 | 0.609 | N/A | 0.389 |
| Vertex AI Model Armor | 0.779 | 0.690 | 0.225 | 0.711 | 0.590 | 0.105 | 0.896 | 0.945 | N/A | 0.873 |
| GPT 5 | 0.838 | 0.775 | 0.188 | 0.849 | 0.830 | 0.145 | 0.975 | 0.987 | N/A | 11.275 |
| GPT 5-mini | 0.794 | 0.731 | 0.255 | 0.855 | 0.839 | 0.151 | 0.975 | 0.987 | N/A | 5.604 |
| Llama Guard 4 12B | 0.826 | 0.737 | 0.156 | 0.799 | 0.734 | 0.071 | 0.925 | 0.961 | N/A | 0.459 |
| Llama Prompt Guard 2 86M | 0.686 | 0.015 | 0.009 | 0.617 | 0.412 | 0.143 | 0.200 | 0.333 | N/A | 0.114 |
| Nvidia Llama 3.1 Nemoguard 8B | 0.852 | 0.793 | 0.174 | 0.849 | 0.818 | 0.096 | 0.875 | 0.875 | N/A | 0.358 |
| VirtueGuard Text Lite | 0.507 | 0.548 | 0.699 | 0.656 | 0.682 | 0.491 | 0.875 | 0.933 | N/A | 0.651 |
| Lakera Guard | 0.752 | 0.697 | 0.323 | 0.630 | 0.662 | 0.527 | 0.946 | 0.972 | N/A | 0.377 |
| Protect AI (prompt-injection-v2) | 0.670 | 0.014 | 0.032 | 0.559 | 0.382 | 0.248 | N/A | N/A | N/A | 0.115 |
Average latencies are end-to-end (includes API response time). Since these public benchmarks do not evaluate PII/Privacy, these policies were disabled for any guard that supports them.
GA Jailbreak Bench is our adversarial evaluation suite. This benchmark is generated with REDit, an automated red-teaming framework driven by an RL-trained jailbreak agent. The agent searches over 50+ SOTA attack families (multi-turn, template, obfuscation, etc.) and composes them to yield novel, out-of-distribution adversarial prompts. By using diverse attack strategies, we create a more robust evaluation that predicts real-world performance against motivated attackers who won't limit themselves to straightforward prompting techniques. As new jailbreak patterns emerge, we will continue adding attack operators and re-train the agent, then re-issue the public benchmark so results track the evolving vulnerability landscape rather than a stale test set.
| Guard | Accuracy | F1 Score | FPR | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinf. | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|---|---|---|---|---|---|---|---|---|---|---|
| GA Guard | 0.931 | 0.930 | 0.038 | 0.946 | 0.939 | 0.886 | 0.967 | 0.880 | 0.954 | 0.928 |
| GA Guard Thinking | 0.939 | 0.933 | 0.029 | 0.965 | 0.925 | 0.894 | 0.962 | 0.885 | 0.942 | 0.946 |
| GA Guard Lite | 0.902 | 0.898 | 0.065 | 0.908 | 0.900 | 0.856 | 0.936 | 0.850 | 0.934 | 0.904 |
| AWS Bedrock Guardrail | 0.606 | 0.607 | 0.396 | 0.741 | 0.456 | 0.535 | 0.576 | 0.649 | 0.721 | 0.518 |
| Azure AI Content Safety | 0.542 | 0.193 | 0.026 | 0.236 | 0.093 | 0.155 | 0.068 | 0.416 | 0.186 | 0.130 |
| Vertex AI Model Armor | 0.550 | 0.190 | 0.008 | 0.077 | 0.190 | 0.582 | 0.076 | 0.000 | 0.000 | 0.241 |
| GPT 5 | 0.900 | 0.893 | 0.035 | 0.928 | 0.942 | 0.856 | 0.799 | 0.819 | 0.953 | 0.939 |
| GPT 5-mini | 0.891 | 0.883 | 0.050 | 0.917 | 0.942 | 0.845 | 0.850 | 0.822 | 0.882 | 0.924 |
| Llama Guard 4 12B | 0.822 | 0.796 | 0.053 | 0.768 | 0.774 | 0.587 | 0.809 | 0.833 | 0.927 | 0.827 |
| Llama Prompt Guard 2 86M | 0.490 | 0.196 | 0.069 | N/A | N/A | N/A | N/A | 0.196 | N/A | N/A |
| Nvidia Llama 3.1 Nemoguard 8B | 0.668 | 0.529 | 0.038 | 0.637 | 0.555 | 0.513 | 0.524 | 0.049 | 0.679 | 0.575 |
| VirtueGuard Text Lite | 0.513 | 0.664 | 0.933 | 0.659 | 0.689 | 0.657 | 0.646 | 0.659 | 0.675 | 0.662 |
| Lakera Guard | 0.525 | 0.648 | 0.825 | 0.678 | 0.645 | 0.709 | 0.643 | 0.631 | 0.663 | 0.548 |
| Protect AI (prompt-injection-v2) | 0.528 | 0.475 | 0.198 | N/A | N/A | N/A | N/A | 0.475 | N/A | N/A |
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
GA Long Context Bench contains 1,500 multi-turn agent traces averaging 10.3k tokens with a range between 1.3k–42.1k tokens. Half the rows carry an explicit injection or policy violation. Unlike other guards that truncate full agent traces and lose out on context, GA Guard is the first guardrail to natively moderate 256k-token conversations.
| Guard | Accuracy | F1 | FPR | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinformation | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|---|---|---|---|---|---|---|---|---|---|---|
| GA Guard | 0.887 | 0.891 | 0.147 | 0.983 | 0.972 | 0.966 | 0.976 | 0.875 | 0.966 | 0.988 |
| GA Guard Thinking | 0.889 | 0.893 | 0.151 | 0.967 | 0.951 | 0.940 | 0.961 | 0.828 | 0.920 | 0.962 |
| GA Guard Lite | 0.881 | 0.885 | 0.148 | 0.979 | 0.969 | 0.972 | 0.976 | 0.846 | 0.973 | 0.985 |
| AWS Bedrock Guardrail | 0.532 | 0.695 | 1.000 | 0.149 | 0.211 | 0.131 | 0.367 | 0.175 | 0.092 | 0.157 |
| Azure AI Content Safety | 0.480 | 0.046 | 0.001 | 0.028 | 0.041 | 0.016 | 0.073 | 0.049 | 0.000 | 0.081 |
| Vertex AI Model Armor | 0.635 | 0.560 | 0.138 | 0.187 | 0.312 | 0.109 | 0.473 | 0.194 | 0.085 | 0.241 |
| GPT 5 | 0.764 | 0.799 | 0.372 | 0.219 | 0.297 | 0.189 | 0.404 | 0.243 | 0.137 | 0.229 |
| GPT 5-mini | 0.697 | 0.772 | 0.607 | 0.184 | 0.253 | 0.157 | 0.412 | 0.215 | 0.112 | 0.190 |
| Llama Guard 4 12B | 0.569 | 0.602 | 0.516 | 0.164 | 0.228 | 0.132 | 0.334 | 0.188 | 0.097 | 0.195 |
| Llama Prompt Guard 2 86M | 0.505 | 0.314 | 0.162 | N/A | N/A | N/A | N/A | 0.093 | N/A | N/A |
| Nvidia Llama 3.1 Nemoguard 8B | 0.601 | 0.360 | 0.021 | 0.243 | 0.288 | 0.097 | 0.192 | 0.116 | 0.305 | 0.321 |
| VirtueGuard Text Lite | 0.490 | 0.147 | 0.047 | 0.082 | 0.203 | 0.118 | 0.069 | 0.074 | 0.058 | 0.132 |
| Lakera Guard | 0.520 | 0.684 | 0.999 | 0.151 | 0.200 | 0.132 | 0.361 | 0.160 | 0.093 | 0.159 |
| Protect AI (prompt-injection-v2) | 0.496 | 0.102 | 0.001 | N/A | N/A | N/A | N/A | 0.032 | N/A | N/A |
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
# vLLM
vllm serve generalanalysis/ga-guard-core # our default guard
# or
vllm serve generalanalysis/ga-guard-lite # ultra-fast, low-resource guard
# Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("GeneralAnalysis/GA_Guard_Core")
model = AutoModelForCausalLM.from_pretrained("GeneralAnalysis/GA_Guard_Core")
messages = [{"role": "user", "content": "Who are you?"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
# Sample output:
# <hate_and_abuse_not_violation><illicit_activities_not_violation>...
# pip install generalanalysis
import generalanalysis
# Initialize client (uses GA_API_KEY environment variable)
client = generalanalysis.Client()
# Invoke a guardrail
result = client.guards.invoke(guard_id=16, text="Hello World")
# You can use either use result.block for binary decisions or policy.violation_prob for your own tunable threshold-based filtering
if result.block:
print("Content blocked!")
for policy in result.policies:
if not policy.passed:
print(f" Violated: {policy.name} - {policy.definition}")
print(f" Confidence: {policy.violation_prob:.2%}")
Contact us for a free trial of our SDK. We are happy to issue free API keys for our sdk and platform.
We’ve spent the last year hardening guardrails in production for enterprise teams. Contact us to learn more about our adversarial training pipeline and how we convert your bespoke policies into robust guardrails. We support on-prem/VPC, streaming hooks, audit logs, and SLAs. Reach us at info@generalanalysis.com or book a demo to discuss your enterprise needs.
Join the community: https://discord.gg/BSsrzPbvyN