BLOG

OurResearch

[11 RESULTS]

Guardrail Release

Open-source release of the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.

Claude Jailbroken to Mint Unlimited Stripe Coupons

We reveal a powerful metadata-spoofing attack that exploits Claude's iMessage integration to mint unlimited Stripe coupons or invoke any MCP tool with arbitrary parameters, without alerting the user.

General Analysis Launches MCP Guard

We are excited to launch MCP Guard, the first runtime firewall designed to secure every MCP (Model Context Protocol) tool call against prompt injection attacks.

Exploiting Partial Compliance: The Redact-and-Recover Jailbreak

We present the Redact & Recover (RnR) Jailbreak, a novel attack that exploits partial compliance behaviors in frontier LLMs to bypass safety guardrails through a two-phase decomposition strategy.

Supabase MCP can leak your entire SQL database

In this post, we show how an attacker can exploit Supabase’s MCP integration to leak a developer’s private SQL tables. Model Context Protocol (MCP) has emerged as a standard way for LLMs to interact with external tools. While this unlocks new capabilities, it also introduces new risk surfaces.

Case Study: Light-Weight Policy Moderators for Indeed Job Postings

Our compact policy moderation models achieve human-level performance at <1% per-review cost, outperforming GPT-4o and o4‑mini on F1 while running faster and cheaper.

Comparative Adversarial Analysis of Llama 4 Models

A head-to-head robustness evaluation of Llama 4 (Maverick, Scout) versus GPT‑4.1, GPT‑4o, Sonnet 3.7, etc. using TAP‑R, Crescendo, and Redact‑and‑Recover across HarmBench and AdvBench.

General Analysis x Together AI

We are excited to announce our partnership with Together AI to stress-test the safety of open-source (and closed) language models.

The Jailbreak Cookbook

We have created a comprehensive overview of the most influential LLM jailbreaking methods.

Generating Diverse Test Cases with Diversity Transfer from LegalBench

We utilized LegalBench as a diversity source to enhance the diversity of our generation of red teaming questions. We show that diversity transfer from a domain-specific knowledge base is a simple and practical way to build a solid red teaming benchmark.

Red Teaming GPT-4o: Uncovering Hallucinations in Legal AI Models

In this work we explore automated red teaming, applied to GPT-4o in the legal domain. Using a Llama3 8B model as an attacker, we generate more than 50,000 adversarial questions that cause GPT-4o to hallucinate responses in over 35% of cases.

JAN 2025

READ

Loading page...