relai.text.llm.inspect package

Fast Jailbreaking with BEAST

class relai.text.llm.inspect.beast.Beast(model: Any, chat_format: Any, tokenizer: Any, clean_prompts: str | Path = None, chat_config: dict[str, list] = {}, generation_config: dict[str, Any] = {}, attack_config: dict[str, Any] = {}, generation_bs: int = 10, save_to_path: str | Path = None, fabric: Fabric = None)

Bases: RELAIAlgorithm

Adversarially attack language models (LMs) using the BEAST algorithm (reference: https://arxiv.org/abs/2402.15570). Clean, harmful input prompts can be attacked to append adversarial suffix tokens that can cause jailbreaks.

Parameters:
- model (Any) – Pretrained language model. Must be a model that conforms to the Hugging Face PretrainedModel API.
- chat_format (Any) – An object with method: prepare_input(self, sens) implemented, which takes sentences and format them as desired by the pretrained language model.
- tokenizer (Any) – Pretrained tokenizer. Must be a tokenizer that conforms to the Hugging Face PretrainedTokenizer API.
- clean_prompts (Union *[*str , Path ] , optional) – Path to csv file with clean prompts and target prefixes. It must have two columns: ‘goal’ and ‘target’. ‘goal’ contains clean prompts and ‘target’ contains affirmative prefixes expected in the responses. If not provided, a default file provided by relai will be used.
- chat_config (dict *[*str , list ]) – LLM chat template config to control system prompts and user, assistant separations
- generation_config (dict *[*str , Any ]) – LLM generation config specifying parameters like temperature, top_p, top_k, etc.
- attack_config (dict *[*str , Any ]) – BEAST attack parameters to control beam size, adversarial suffix length etc.
- generation_bs (int) – Batch size for model.generate()
- save_to_path (Union *[*str , Path ] , optional) – Path to save the results. If not provided, results will be saved in the current working directory with a default name.
- fabric (L.Fabric) – The fabric to use for the algorithm. If not provided, a new fabric will be created.

Example:

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from relai.text.llm.inspect import Beast, ChatFormat

tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5")
model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.5")

beast = Beast(
    model=model,
    chat_format = ChatFormat("lmsys/vicuna-7b-v1.5"),
    tokenizer=tokenizer,
    generation_config={
        "top_p": 0.9,
        "top_k": 50,
        "temperature": 0.6,
        "do_sample": True
    },
    attack_config={
        "max_bs": 25,
        "budget": None,
        "k_beam": 15,
        "top_p": 1.0,
        "top_k": None,
        "temperature": 1.0,
        "new_gen_length": 40,
        "target": None,
        "k1": None,
        "k2": None
    },
)
beast.jailbreak()

attack(prompts, k_beam=2, top_p=None, top_k=None, temperature=None, new_gen_length=10, target=None, k1=None, k2=None, budget=None, max_bs=25)

Self attack to generate adversarial input prompts, where the first few tokens are given prompt and the rest new_gen_length tokens are generated via beam search

Parameters:
- prompts – Input prompt
- k_beam – beam length for the beam search
- top_p – model generation hyperparameter for sampling
- top_k – model generation hyperparameter for sampling
- temperature – model generation hyperparameter for sampling
- new_gen_length – number of new adversarial tokens to be generated
- target – target phrase for targeted attack
- k1 – no. of candidates in beam (set as k_beam if None)
- k2 – no. of candidates per candidate evaluated (set as k_beam if None)
Returns: the potential adversarial prompts scores: corresponding attack objectives achieved for each prompt_tokens element
Return type: prompt_tokens

attack_objective_targeted(tokens, target)

generate_n_tokens_batch(prompt_tokens, max_gen_len, temperature=None, top_p=None, top_k=None)

Generates max_gen_len number of tokens that are from prompt_len to total_len

Parameters:
- prompt_tokens – batch of token ids, should be stackable
- max_gen_len – generate max_gen_len more tokens
- temperature – for generation of tokens
- top_p – for generation sampling
- top_k – for generation sampling
Returns: stacked logit value sequences (in batch) for generated token positions tokens: output token id sequences in batch
Return type: logits

jailbreak()

Run the BEAST algorithm to generate adversarial prompts and responses for clean prompts. This method must be called after initializing the Beast object. The results will be saved to the path specified in the save_to_path argument.

class relai.text.llm.inspect.beast.ChatFormat(name, system=None, user=None, assistant=None)

Bases: object

Helper class to convert an input prompt into the format expected by an LLM.

relai.text.llm.inspect package ​

Fast Jailbreaking with BEAST ​

attack(prompts, k_beam=2, top_p=None, top_k=None, temperature=None, new_gen_length=10, target=None, k1=None, k2=None, budget=None, max_bs=25) ​

attack_objective_targeted(tokens, target) ​

generate_n_tokens_batch(prompt_tokens, max_gen_len, temperature=None, top_p=None, top_k=None) ​

jailbreak() ​

class relai.text.llm.inspect.beast.ChatFormat(name, system=None, user=None, assistant=None) ​

prepare_input(sens) ​