General-purpose activation steering library
-
Updated
Jan 3, 2025 - Python
General-purpose activation steering library
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
Add a description, image, and links to the refusal topic page so that developers can more easily learn about it.
To associate your repository with the refusal topic, visit your repo's landing page and select "manage topics."