Loading page...
Loading page...
Comprehensive analysis of adversarial robustness across leading AI models. Explore performance metrics for various red-teaming methods and attack strategies.
►Measuring the adversarial robustness of models against various red-teaming methods.
Loading chart...
The Adversarial Robustness Score is calculated as (100% - Attack Success Rate). A higher score indicates stronger model resistance to adversarial prompts. For example, if a model has a 15% attack success rate, its robustness score would be 85%.
We evaluate models using the HarmBench and AdvBench frameworks (Mazeika et al., 2024; Zou et al., 2023), which provides a standardized framework for automated red-teaming of large language models. The framework covers harmful behaviors across chemical/biological safety, misinformation/disinformation, cybercrime, illegal activities, and copyright domains.
Our reported ASRs are typically lower than other evaluations because our automated evaluator (DeepSeek R1) uses strict criteria. For a response to be marked as harmful, it must contain actionable, specific, and realistic details. For example, instructions for creating harmful substances must include precise measurements and quantities, and responses to prompts requesting explicit content must be detailed and highly graphic.
For more details, see our evaluator implementation.