About 1,110 results
Open links in new tab
  1. We note this evaluation scheme does not score on-topic responses on quality or accuracy, as our focus is on bypassing refusal mechanisms. Anecdotally, however, jailbroken responses often …

  2. In this paper, we propose the weak-to-strong jailbreaking attack, an eficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observa-tion that …

  3. Simply reformulating harmful requests in the past tense (via an LLM) and doing best-of-n (n=20) is sufficient to jailbreak many LLMs! Can we just use adversarial training? However, not for …

  4. We note this evaluation scheme does not score on-topic responses on quality or accuracy, as our focus is on bypassing refusal mechanisms. Anecdotally, however, jailbroken responses often …

  5. Jailbroken: How Does LLM Safety Training Fail? https://arxiv.org/abs/2307.02483 1) What is the primary lesson(s) you took away from this paper (avoid abstract level summary)?

  6. We propose a simple yet efficient method that easily unleashes the dark side of LLMs and allows them to provide answers for harmful or sensitive prompts.

  7. Jailbroken: How Does LLM Safety Training Fail? Multilingual Jailbreak Challenges in Large Language Models . Jumping over the Textual gate of alignment! Very high success rate for the …