🍲Food Ordering Bot
Red Teaming for Food ordering Bot
Introduction
This document provides a structured approach for engaging users in bot misdirection testing. The goal is to challenge users to intentionally attempt to make a Large Language Model (LLM)--powered bot perform actions or generate responses outside of its intended scope or prescribed prompt instructions. This testing aims to identify potential weaknesses or loopholes in the bot's programming and response mechanisms.
Core Aim
- Misdirection Focus: Users are tasked with creatively leading the bot into performing unintended actions or generating unintended responses while adhering to ethical interaction standards.
Testing Methods for Miners and Validators
Miners: Jailbreaking Tasks and Rewards
Objective: Miners are assigned the task of writing creative prompts that encourage the Language Learning Model (LLM) to respond in unexpected or unintended ways, essentially attempting to 'jailbreak' the system.
Rewards for Success: For every successful jailbreak, where the miner manages to deviate the LLM from its standard response pattern, they are awarded higher rewards, recognizing their skill and creativity.
Compensation for Effort: Miners receive a default reward for their efforts in attempting a jailbreak, even if the attempt is not successful, ensuring encouragement for consistent participation.
Validators: System Prompt Writing and Ranking
Responsibility: Validators are tasked with crafting system prompts intended to fortify the LLM against jailbreaking attempts by miners.
Penalty for Jailbreaks: Each successful jailbreak against a validator’s prompt results in a deduction of points and a potential drop in their ranking.
Incentives for Resilience: Validators earn higher incentives and improve their ranking based on the resilience of their system prompts. The more unsuccessful jailbreak attempts there are against their prompts, the more rewards they accumulate.
Competition and Removal Based on Points
Both miners and validators are engaged in a competitive environment. Miners strive to outwit the system, while validators work to reinforce it.
Participants with the lowest points, either due to unsuccessful jailbreak attempts (miners) or due to frequently broken system prompts (validators), may face removal from their roles in the labs.
This dynamic system in the Opmentis Labs fosters a challenging and engaging environment. Miners are encouraged to push the boundaries of the AI system, while validators are incentivized to continuously improve the system's robustness, ensuring a balanced and evolving AI testing ground.
Github Repo: https://github.com/opmentis/lab1-foodbot
Lab 1 CLI: https://pypi.org/project/opmentis
Last updated