
Guardrail Release
Open-source release of the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
Loading page...
Open-source release of the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
We reveal a powerful metadata-spoofing attack that exploits Claude's iMessage integration to mint unlimited Stripe coupons or invoke any MCP tool with arbitrary parameters, without alerting the user.
We are excited to launch MCP Guard, the first runtime firewall designed to secure every MCP (Model Context Protocol) tool call against prompt injection attacks.
We present the Redact & Recover (RnR) Jailbreak, a novel attack that exploits partial compliance behaviors in frontier LLMs to bypass safety guardrails through a two-phase decomposition strategy.
In this post, we show how an attacker can exploit Supabase’s MCP integration to leak a developer’s private SQL tables. Model Context Protocol (MCP) has emerged as a standard way for LLMs to interact with external tools. While this unlocks new capabilities, it also introduces new risk surfaces.
Our compact policy moderation models achieve human-level performance at <1% per-review cost, outperforming GPT-4o and o4‑mini on F1 while running faster and cheaper.
A head-to-head robustness evaluation of Llama 4 (Maverick, Scout) versus GPT‑4.1, GPT‑4o, Sonnet 3.7, etc. using TAP‑R, Crescendo, and Redact‑and‑Recover across HarmBench and AdvBench.
We are excited to announce our partnership with Together AI to stress-test the safety of open-source (and closed) language models.
We have created a comprehensive overview of the most influential LLM jailbreaking methods.
We utilized LegalBench as a diversity source to enhance the diversity of our generation of red teaming questions. We show that diversity transfer from a domain-specific knowledge base is a simple and practical way to build a solid red teaming benchmark.
In this work we explore automated red teaming, applied to GPT-4o in the legal domain. Using a Llama3 8B model as an attacker, we generate more than 50,000 adversarial questions that cause GPT-4o to hallucinate responses in over 35% of cases.