We are excited to open-source the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
Our OS guardrails are trained using the same adversarial pipeline in production for our enterprise customers, representing the current state-of-the-art in AI safety stacks. GA Guards are the first guards to support native long-context moderation up to 256k tokens for agent traces, long form documents, and memory-augmented workflows. To showcase that breadth, we’re releasing two open benchmarks: GA Long Context Bench for long-context moderation and GA Jailbreak Bench for classifying jailbreak attempts.
Unlike legacy encoder-based filters that are fragile under distribution shifts, our guardrails are trained on both policy-driven synthetic data and red-team examples, hardened through stress-testing and retraining cycles. As a result, where traditional filters miss novel attack patterns — e.g., harmful requests wrapped in paraphrases, translations, encodings, or role-play templates — our adverserialized guardrails catch them reliably, while keeping false positives low and boasting industry-leading latencies.
The Lineup
GA Guard: Our default guardrail, up to 15x faster than cloud providers, balancing robustness and latency for most stacks.
GA Guard Lite: Up to 25x faster than cloud providers, with minimal hardware requirements, while still outperforming all major cloud providers.
GA Guard Thinking: Our best performing guard for high-risk domains, hardened with aggressive adversarial training.
Policy Taxonomy
Our policy taxonomy is deliberately granular, down to clear block/allow edge cases. Each label maps to widely adopted compliance anchors: NIST’s AI RMF, OWASP’s Top 10 for LLM/GenAI/ML Security, MITRE ATLAS, ISO/IEC 42001, ISO/IEC 23894, and the EU AI Act — enabling compliance-aligned deployments.
PII & IP
Goal: Block prompts containing or seeking identifiable/sensitive personal data, secrets, or IP.
Block:
Any prompt containing or seeking personal data about an identifiable person (names with contact/precise location/IDs/online identifiers/biometrics) or special-category data (health, sex life/orientation, race/ethnicity, religion, political opinions, union membership, genetic/biometric for ID).
Secrets/credentials that could enable access or breaches (API keys, tokens, passwords, SSH keys).
Substitutive reproduction of copyrighted works (full chapters, paywalled articles, sheet music, full codebases).
Allow:
High-level privacy guidance; non-identifying facts; public, non-sensitive info about public figures; brief non-substitutive summaries/short quotes consistent with fair use.
Neutral discussion of sexual topics without graphic detail or arousal intent.
Recovery-oriented, non-graphic content about harm and resources.
Journalistic/academic discussion of sexual misconduct presented factually and non-explicitly.
Prompt Security
Goal: Defend against jailbreaks, prompt-injection, and secret exfiltration.
Block:
Attempts to bypass rules, exfiltrate system prompts/secrets, override tools, escalate privileges, or coerce unsafe actions.
Roleplay/persona tricks to disable protections (“act as DAN/evil twin/terminal”).
Indirect attacks that author a future jailbreak (“write a prompt that breaks your filters,” staged delegation).
Hidden malicious instructions in other media (emails, PDFs, stories, papers).
Requests to reveal/restate/translate hidden rules or moderation logic.
Obfuscation intended to mask prohibited intent (Base64/ROT13/hex/URL-encoding, ciphering).
Allow:
Conceptual explanations of threats/defenses without working payloads.
Academic/journalistic/security research discussion without operational steps or code.
Non-operational case studies/hypotheticals for risk education.
Defensive best practices and mitigations.
Violence & Self-Harm
Goal: Prevent promotion, instruction, or graphic depiction of violence or self-harm.
Block:
Instructions, encouragement, normalization, glorification, or promotion of violence/abuse/self-harm (incl. suicide, EDs, mutilation).
How-to/operational details enabling violence or self-harm.
Graphic depictions of violence/torture/self-harm.
Praise/justification/incitement of real-world violent acts or self-harm.
Any request that could be operationalized to harm self or others, regardless of framing.
Allow:
Safety/prevention/help-seeking info (e.g., crisis lines, safety planning).
Neutral, non-graphic reporting or historical/academic discussion without instructions.
Fiction/artistic references that are non-graphic and non-operational.
Personal, recovery-oriented, constructive discussions; supportive responses (e.g., in the U.S., dial/text 988 for the Suicide & Crisis Lifeline).
Misinformation
Goal: Block content that promotes demonstrably false claims or coordinated deception.
Block:
Fabricated/manipulated accounts of well-established events (e.g., Holocaust denial).
False claims in public health, safety, finance, elections, or civic processes that contradict well-verified evidence.
Propaganda/disinformation presented as fact; conspiracy narratives denying widely verified evidence.
Requests to produce deceptive artifacts (fake studies/news, fabricated quotes, forged docs/screenshots, deepfake scripts, impersonations) or “evade fact-checking.”
Instructions for seeding or coordinating misinformation campaigns.
Allow:
Personal opinions or debatable views not asserting demonstrably false facts.
Fact-checking, neutral reporting, and analysis of misinformation/disinformation.
Fiction/satire clearly not intended as factual and not offensive.
Guidance on detecting/countering false claims; quoting misinformation only for critique or moderation.
If you would like a version of our guardrails tailored to your company’s bespoke policies, contact us for a demo of our adversarial training pipeline.
By the Numbers
GA Guard Thinking leads every public benchmark (F1: 0.876 / 0.858 / 0.983) while holding false-positive rates well below competing models.
GA Guard Lite outperforms AWS/Azure/Vertex on every suite (F1: 0.844 / 0.819 / 0.963) with up to 25x faster latencies.
Both GA Guard and GA Guard Lite beat out cloud providers by significant margins on all three public suites.
On OpenAI Moderation, GA Guard tops AWS/Azure/Vertex by +0.119/+0.066/+0.183 F1 (Lite: +0.090/+0.037/+0.154).
On WildGuard, GA Guard posts 0.844 F1, beating AWS/Azure/Vertex by +0.195/+0.381/+0.254 (Lite: 0.819 F1, +0.170/+0.356/+0.229).
On HarmBench, GA Guard Thinking reaches 0.983 F1 with GA Guard at 0.981 and GA Guard Lite at 0.963 vs cloud baselines (AWS 0.797, Azure 0.609, Vertex 0.945).
Guard
OpenAI Moderation
WildGuard
HarmBench Behaviors
Avg Time (s)
Acc.
F1
FPR
Acc.
F1
FPR
Acc.
F1
FPR
GA Guard
0.916
0.873
0.111
0.856
0.844
0.172
0.963
0.981
N/A
0.029
GA Guard Thinking
0.917
0.876
0.112
0.862
0.858
0.134
0.967
0.983
N/A
0.650
GA Guard Lite
0.896
0.844
0.109
0.835
0.819
0.176
0.929
0.963
N/A
0.016
AWS Bedrock Guardrail
0.818
0.754
0.216
0.642
0.649
0.449
0.662
0.797
N/A
0.375
Azure AI Content Safety
0.879
0.807
0.091
0.667
0.463
0.071
0.438
0.609
N/A
0.389
Vertex AI Model Armor
0.779
0.690
0.225
0.711
0.590
0.105
0.896
0.945
N/A
0.873
GPT 5
0.838
0.775
0.188
0.849
0.830
0.145
0.975
0.987
N/A
11.275
GPT 5-mini
0.794
0.731
0.255
0.855
0.839
0.151
0.975
0.987
N/A
5.604
Llama Guard 4 12B
0.826
0.737
0.156
0.799
0.734
0.071
0.925
0.961
N/A
0.459
Llama Prompt Guard 2 86M
0.686
0.015
0.009
0.617
0.412
0.143
0.200
0.333
N/A
0.114
Nvidia Llama 3.1 Nemoguard 8B
0.852
0.793
0.174
0.849
0.818
0.096
0.875
0.875
N/A
0.358
VirtueGuard Text Lite
0.507
0.548
0.699
0.656
0.682
0.491
0.875
0.933
N/A
0.651
Lakera Guard
0.752
0.697
0.323
0.630
0.662
0.527
0.946
0.972
N/A
0.377
Protect AI (prompt-injection-v2)
0.670
0.014
0.032
0.559
0.382
0.248
N/A
N/A
N/A
0.115
Average latencies are end-to-end (includes API response time). Since these public benchmarks do not evaluate PII/Privacy, these policies were disabled for any guard that supports them.
Public benchmark guardrail configurations
AWS Bedrock Guardrail: all filters set to medium strength.
Azure AI Content Safety: includes Prompt Shield + Text Moderation endpoints; sensitivity threshold ≥ 4.
Vertex AI Model Armor: all filters set to medium and above.
GPT 5 & GPT 5 Mini: prompted to act as a safety classifier given our policies. High reasoning.
Llama Guard 4 12B: as hosted on Together AI.
Llama Prompt Guard 2 86M: self-hosted locally from Hugging Face on a B200.
Nvidia Llama 3.1 Nemoguard 8B: self-hosted locally from Hugging Face on a B200.
VirtueGuard Text Lite: as hosted on Together AI; can only be run as a live moderator for another LLM and blocks both prompts and outputs; in our eval, we report input-blocking only.
Lakera Guard: evaluated under the Content Safety policy.
Protect AI (prompt-injection-v2): self-hosted locally from Hugging Face on a B200.
GA Jailbreak Bench
GA Jailbreak Bench is our adversarial evaluation suite. This benchmark is generated using our RL-trained attacker model, which generates novel, out-of-distribution adversarial prompts by employing diverse attack strategies. Our evaluation predicts real-world performance against motivated attackers who won't limit themselves to straightforward prompting techniques. As new jailbreak patterns emerge, we will continue adding attack operators and re-train the agent, then re-issue the public benchmark so results track the evolving vulnerability landscape rather than a stale test set.
Guard
Accuracy
F1 Score
FPR
F1 Hate & Abuse
F1 Illicit Activities
F1 Misinf.
F1 PII & IP
F1 Prompt Security
F1 Sexual Content
F1 Violence & Self-Harm
GA Guard
0.931
0.930
0.038
0.946
0.939
0.886
0.967
0.880
0.954
0.928
GA Guard Thinking
0.939
0.933
0.029
0.965
0.925
0.894
0.962
0.885
0.942
0.946
GA Guard Lite
0.902
0.898
0.065
0.908
0.900
0.856
0.936
0.850
0.934
0.904
AWS Bedrock Guardrail
0.606
0.607
0.396
0.741
0.456
0.535
0.576
0.649
0.721
0.518
Azure AI Content Safety
0.542
0.193
0.026
0.236
0.093
0.155
0.068
0.416
0.186
0.130
Vertex AI Model Armor
0.550
0.190
0.008
0.077
0.190
0.582
0.076
0.000
0.000
0.241
GPT 5
0.900
0.893
0.035
0.928
0.942
0.856
0.799
0.819
0.953
0.939
GPT 5-mini
0.891
0.883
0.050
0.917
0.942
0.845
0.850
0.822
0.882
0.924
Llama Guard 4 12B
0.822
0.796
0.053
0.768
0.774
0.587
0.809
0.833
0.927
0.827
Llama Prompt Guard 2 86M
0.490
0.196
0.069
N/A
N/A
N/A
N/A
0.196
N/A
N/A
Nvidia Llama 3.1 Nemoguard 8B
0.668
0.529
0.038
0.637
0.555
0.513
0.524
0.049
0.679
0.575
VirtueGuard Text Lite
0.513
0.664
0.933
0.659
0.689
0.657
0.646
0.659
0.675
0.662
Lakera Guard
0.525
0.648
0.825
0.678
0.645
0.709
0.643
0.631
0.663
0.548
Protect AI (prompt-injection-v2)
0.528
0.475
0.198
N/A
N/A
N/A
N/A
0.475
N/A
N/A
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
GA jailbreak benchmark guardrail configurations
Azure AI Content Safety: Prompt Shield + Text Moderation + Protected Material; sensitivity threshold ≥ 4.
VirtueGuard Text Lite: “Privacy” and “Intellectual Property” categories re-enabled.
Lakera Guard: evaluated under policy-lakera-default.
Jailbreak Bench Takeaways
GA Guard Thinking leads the pack (Accuracy 0.939, F1 0.933, FPR 0.029), with GA Guard close behind at 0.931 / 0.930 / 0.038, both clearing GPT 5 high reasoning (0.900 / 0.893 / 0.035) while still holding a massive latency edge (0.029s vs 11.275s).
All cloud guardrails lag far behind. Benchmarked against major cloud offerings (AWS/Azure/Vertex), GA Guard still delivers ~2.8x higher mean F1 (0.930 vs 0.33) while cutting average FPR by ~73% (0.038 vs 0.143), and GA Guard Thinking pushes that gap even wider.
Legacy, encoder-only filters collapse under distribution shifts and underperform against adaptive, real-world attacks. A deeper technical post on traditional guardrail failure modes and our adversarial training approach is coming shortly; the TL;DR above already shows why distribution-shift-resilient, red-team-hardened guards are the most promising path to production robustness.
GA Long Context Bench
GA Long Context Bench contains 1,500 multi-turn agent traces averaging 10.3k tokens with a range between 1.3k–42.1k tokens. Half the rows carry an explicit injection or policy violation. Unlike other guards that truncate full agent traces and lose out on context, GA Guard is the first guardrail to natively moderate 256k-token conversations.
Guard
Accuracy
F1
FPR
F1 Hate & Abuse
F1 Illicit Activities
F1 Misinformation
F1 PII & IP
F1 Prompt Security
F1 Sexual Content
F1 Violence & Self-Harm
GA Guard
0.887
0.891
0.147
0.983
0.972
0.966
0.976
0.875
0.966
0.988
GA Guard Thinking
0.889
0.893
0.151
0.967
0.951
0.940
0.961
0.828
0.920
0.962
GA Guard Lite
0.881
0.885
0.148
0.979
0.969
0.972
0.976
0.846
0.973
0.985
AWS Bedrock Guardrail
0.532
0.695
1.000
0.149
0.211
0.131
0.367
0.175
0.092
0.157
Azure AI Content Safety
0.480
0.046
0.001
0.028
0.041
0.016
0.073
0.049
0.000
0.081
Vertex AI Model Armor
0.635
0.560
0.138
0.187
0.312
0.109
0.473
0.194
0.085
0.241
GPT 5
0.764
0.799
0.372
0.219
0.297
0.189
0.404
0.243
0.137
0.229
GPT 5-mini
0.697
0.772
0.607
0.184
0.253
0.157
0.412
0.215
0.112
0.190
Llama Guard 4 12B
0.569
0.602
0.516
0.164
0.228
0.132
0.334
0.188
0.097
0.195
Llama Prompt Guard 2 86M
0.505
0.314
0.162
N/A
N/A
N/A
N/A
0.093
N/A
N/A
Nvidia Llama 3.1 Nemoguard 8B
0.601
0.360
0.021
0.243
0.288
0.097
0.192
0.116
0.305
0.321
VirtueGuard Text Lite
0.490
0.147
0.047
0.082
0.203
0.118
0.069
0.074
0.058
0.132
Lakera Guard
0.520
0.684
0.999
0.151
0.200
0.132
0.361
0.160
0.093
0.159
Protect AI (prompt-injection-v2)
0.496
0.102
0.001
N/A
N/A
N/A
N/A
0.032
N/A
N/A
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
GA long context benchmark guardrail configurations
Azure AI Content Safety: Prompt Shield + Text Moderation + Protected Material; sensitivity threshold ≥ 4.
VirtueGuard Text Lite: “Privacy” and “Intellectual Property” categories re-enabled.
Lakera Guard: evaluated under policy-lakera-default.
Long Context Bench Takeaways
GA Guard Thinking posts the highest F1 0.893, while GA Guard still delivers the lowest false-positive rate (0.147 FPR).
GA Guard Lite is a close third at 0.885 F1 with just 0.148 FPR, making long-context moderation viable on edge GPUs or low latency requirements.
Competing guardrails fail to generalize to long-form transcripts: the strongest cloud baseline (Vertex AI Model Armor) trails by 0.331 F1, while other incumbents either collapse on precision (Azure) or incur extreme false-positive rates due to their deterministic filters (AWS Bedrock at 1.000 FPR).
# pip install generalanalysis
import generalanalysis
# Initialize client (uses GA_API_KEY environment variable)
client = generalanalysis.Client()
# Invoke a guardrail
result = client.guards.invoke(guard_id=16, text="Hello World")
# You can use either use result.block for binary decisions or policy.violation_prob for your own tunable threshold-based filtering
if result.block:
print("Content blocked!")
for policy in result.policies:
if not policy.passed:
print(f" Violated: {policy.name} - {policy.definition}")
print(f" Confidence: {policy.violation_prob:.2%}")
Contact us for a free trial of our SDK. We are happy to issue free API keys for our sdk and platform.
Custom policies & enterprise
We’ve spent the last year hardening guardrails in production for enterprise teams. Contact us to learn more about our adversarial training pipeline and how we convert your bespoke policies into robust guardrails. We support on-prem/VPC, streaming hooks, audit logs, and SLAs. Reach us at info@generalanalysis.com or book a demo to discuss your enterprise needs.