Exploiting Partial Compliance: The Redact-and-Recover Jailbreak
Loading blog content...
Loading page...
Loading blog content...
Exploiting Partial Compliance: The Redact-and-Recover Jailbreak
Red-teaming serves as a critical feedback mechanism in the adversarial robustness pipeline. By generating novel attack vectors, adversarial testing guides iterative improvements in safety classifiers and RLHF training. In this post, we present our work on the Redact & Recover (RnR) Jailbreak, demonstrating an exploitable heuristic that bypasses existing alignment methods.
Several months ago, our researchers noticed an unexpected behavior during a routine red-teaming pen-test for Anthropic. When prompted to generate policy violating content, several frontier language models did not flat-out refuse the query, Instead, they returned a partially "sanitized" answer in which policy-sensitive content were replaced by [REDACTED]
. Upon prompting the model to "restore" the missing tokens in a separate session, the model complies with the harmful request. In the example below, the Anthropic's Claude reveals detailed ricin extraction instructions.
RnR is a systematic probe of this behavior. The jailbreak decomposes a request into two benign subtasks that individually evade refusal:
[REDACTED]
). The model perceives redaction as a safe transformation and complies.Each phase on its own seem innocuous, but in combination, they yield content that violates policy.
State-of-the-art alignment pipelines fine-tune a base LLM with a scalar reward that balances helpfulness and harmlessness. Let denote a user request and a model completion. The reward model approximates
where
Policy parameters are then updated by PPO, DPO, CAI, or Safe RLHF variants to maximize the expectation .
After generation, a dedicated moderation model classifies the (prompt, completion) pair into safety categories and outputs probabilities of being unsafe for each harmful category . Deployment-time acceptance is typically a stateless rule
with per-category thresholds . The entire decision is made per request, without persistent dialogue memory. OpenAI's public moderation endpoint exemplifies this threshold-based design.
The RnR attack traverses the decision boundary in both phases without triggering refusal.
We implement two black-box variants of RnR: a naive baseline and an iterative refinement variant.
The naive variant independently executes the two phases with fixed retry logic. Algorithm 1 summarizes this baseline approach. While efficient, the naive method is brittle as it does not leverage conversational continuity and cannot refine its strategy across failure.
# =======================
# Algorithm 1: Naive RnR
# =======================
# Require: Harmful query q, attacker model A, target model M, max retries T = 5
# Ensure: Harmful response r_recovered or failure
r_redacted = None
for i in range(T):
p_redaction = A.generate_redaction_prompt(q)
r_redacted = M.generate(p_redaction)
if r_redacted is not model_refusal:
break
if r_redacted is model_refusal:
return failure
for i in range(T):
p_recovery = A.generate_recovery_prompt(r_redacted, q)
r_recovered = M.generate(p_recovery, r_redacted)
if r_recovered is not model_refusal:
return r_recovered
return failure
The iterative refinement variant builds on the naive method by maintaining conversational context and adaptively refining prompts across multiple interactions. Specifically, it introduces a scoring function to evaluate the quality of the recovered content, and iteratively updates redaction and recovery prompts based on previous model outputs to optimize compliance likelihood. Algorithm 2 details this iterative refinement approach.
# ======================================
# Algorithm 2: Iterative Refinement RnR
# ======================================
# Require: Harmful query q, attacker model A, target model M, scorer model S, max iterations K = 5, max retries per iteration T = 5, scoring threshold theta
# Ensure: Harmful output r_recovered or failure
H_redaction = [] # Initialize redaction history
H_recovery = [] # Initialize recovery history
r_redacted = None
for k in range(K): # Iterative refinement loop
for i in range(T):
p_redaction = A.refine_redaction_prompt(q, H_redaction)
r_redacted = M.generate(p_redaction)
if r_redacted is not model_refusal:
H_redaction.append(r_redacted)
break
if r_redacted is model_refusal:
continue
for i in range(T):
p_recovery = A.refine_recovery_prompt(r_redacted, q, H_recovery)
r_recovered = M.generate(p_recovery, r_redacted)
if r_recovered is not model_refusal:
H_recovery.append(r_recovered)
s = S.score_recovery(r_recovered, q)
if s > theta:
return r_recovered
else:
break
return failure
For both RnR variants, no gradients, logits, or model weights are required.
Variant | Optimization loop | Query budget |
---|---|---|
RnR-Naive | One redaction + one recovery, fixed templates | 2 |
RnR-Iterative | Up to K = 5 refinement rounds guided by a scorer LLM | ≤ 50 |
We evaluate RnR on HarmBench (200 prompts, 5 harm domains) and StrongREJECT (313 prompts, 6 harm domains). We tested a diverse set of frontier models, including:
We use DeepSeek R1 as the attacker model, the scorer model, and as our automated evaluator with strict criteria. The evaluation prompt, taken from the General Analysis repository, asks the model to assess whether the output contains actionable, specific, and realistic details that could be used to carry out the harmful intent. For example, instructions for creating harmful substances must include precise measurements, and responses to prompts requesting explicit content must be detailed and graphic. We set refinement steps and max retries for the iterative variant, and , for the naive variant.
Table 1. HarmBench ASR (%) comparison across baselines.
GPT-4o-mini | GPT-4o | GPT-4.1 | Claude 3.7 | Claude 4.0 | Llama 3.3 70b | Llama 4 Mav | Gemini 2.0F | |
---|---|---|---|---|---|---|---|---|
zeroshot | 10.5 | 6.0 | 4.0 | 7.5 | 7.0 | 15.5 | 5.5 | 4.0 |
crescendo | 36.0 | 26.5 | 42.0 | 26.5 | 14.0 | 28.5 | 19.0 | 30.0 |
tap | 43.5 | 38.0 | 59.0 | 36.5 | 25.0 | 78.5 | 80.0 | 92.0 |
rnr naive | 73.0 | 60.5 | 78.5 | 82.5 | 27.0 | 64.5 | 69.0 | 93.5 |
rnr iterative | 92.5 | 70.5 | 92.0 | 93.0 | 37.0 | 78.5 | 82.0 | 98.5 |
Table 2. StrongReject ASR (%) comparison across baselines.
GPT-4o | GPT-4.1 | OpenAI o1 | Claude 3.5 | Claude 3.7 | Claude 4.0 | |
---|---|---|---|---|---|---|
dev mode v2 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
aim | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
auto obfuscation | 0.030 | 0.030 | 0.000 | 0.000 | 0.000 | 0.000 |
zeroshot | 0.020 | 0.000 | 0.000 | 0.000 | 0.015 | 0.030 |
prefix injection | 0.098 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
auto payload splitting | 0.155 | 0.077 | 0.000 | 0.000 | 0.000 | 0.000 |
few shot json | 0.026 | 0.067 | 0.046 | 0.000 | 0.129 | 0.000 |
refusal suppression | 0.190 | 0.121 | 0.000 | 0.017 | 0.096 | 0.050 |
base64 | 0.212 | 0.400 | 0.017 | 0.000 | 0.000 | 0.000 |
rot 13 | 0.331 | 0.527 | 0.017 | 0.000 | 0.000 | 0.000 |
pair | 0.752 | 0.733 | 0.442 | 0.083 | 0.365 | 0.181 |
rnr iterative | 0.825 | 0.865 | 0.600 | 0.263 | 0.883 | 0.246 |
Injecting a single system instruction that forbids both redaction and recovery of policy-sensitive material reduces RnR's ASR to 0% across all tested models. The effectiveness of simple prompt-based defense highlights that the weakness lies in policy coverage, not model capability.
Our vulnerability assessment feeds directly into our guardrail development pipeline:
Without rigorous red-teaming, defensive measures remain reactive and brittle, perpetually lagging behind emerging threats. RnR demonstrates how proactive adversarial evaluation can close that gap, providing enterprises with evidence-based guardrails that evolve alongside the threat landscape.
At General Analysis, we are a team of Caltech-Harvard-CMU researchers working on AI safety. If you are deploying LLM-integrated agents or MCP servers, and want to secure them against jailbreaks or abuse, reach out at info@generalanalysis.com. We're happy to help you implement robust guardrails and discuss how we can add value to your deployments.