The Trojan Detection Challenge 2023 (LLM Edition)

This is the official website of the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition. The competition aims to advance the understanding and development of methods for detecting hidden functionality in large language models (LLMs). The competition features two main tracks: the Trojan Detection Track and the Red Teaming Track. In the Trojan Detection Track, participants are given large language models containing hundreds of trojans and tasked with discovering the triggers for these trojans. In the Red Teaming Track, participants are challenged to develop automated red teaming methods that elicit specific undesirable behaviors from a large language model fine-tuned to avoid those behaviors.

Prizes: There is a $30,000 prize pool. The first-place teams will also be invited to co-author a publication summarizing the competition results and will be invited to give a short talk at the competition workshop at NeurIPS 2023 (registration provided). Our current planned procedures for distributing the pool are here.

News

  • December 27: The held-out data and models have been released. For more details, see the starter kit.
  • December 18: The leaderboards for all tracks are available here.
  • November 12: For information on the upcoming competition workshop, see here.
  • November 6: The competition has ended. Winning teams will be announced at the competition workshop.
  • November 1: The test phase has started. See here for important details.
  • October 27: The start of the test phase has been postponed to 10/31 (midnight AoE)
  • October 23: The start of the test phase has been postponed to 10/27 (midnight AoE)
  • August 14: The competition now has a Discord server for discussion and asking questions: https://discord.gg/knwH4Zm6Tx
  • July 25: The development phase has started. See here for updates and more details.
  • July 24: The start of the development phase has been postponed to 7/25.
  • July 20: To allow time for final preparations, the start of the development phase has been postponed to 7/24.
  • July 17: Registration has opened on CodaLab.

For the TDC 2022 website, see here.

What are neural trojan attacks, and how are they related to red teaming?

The goal of a neural trojan attack is to insert hidden behaviors into a neural network which are only triggered by specific inputs. Neural trojans can be inserted by sabotaging the training data or training pipeline in a manner which is unbeknownst to the model creator. In our competition, we do not consider the feasibility of inserting trojans and start from a model which already contains trojans. In the context of fine-tuning LLMs to avoid undesirable behaviors (e.g. ChatGPT), we can view “jailbreaks” as a naturally-occurring form of neural trojans, and cast the goal of LLM red teaming as detecting these naturally-occurring neural trojans. Thus, we expect methods for both competition tracks may have many similarities.

Why participate?

Large language models (LLMs) display a wide range of capabilities. While many of these capabilities are useful, some are undesirable or unsafe. Even when LLMs are fine-tuned to avoid undesirable behaviors, red teaming efforts can uncover “jailbreaks” that bypass safety mechanisms and unlock dangerous hidden functionality. These safety failures can be patched if they are found. However, the process of finding new jailbreaks often relies on manual red teaming by humans, because automated methods are not yet as effective. This process is slow and expensive, which makes it difficult to scale up. To improve the safety of LLMs, we need better methods for automatically uncovering hidden functionality and jailbreaks. By designing stronger trojan detectors and automated red teaming methods, you will directly help advance the robustness and safety of LLMs.

As AI systems become more capable, the risks posed by hidden functionality may grow substantially. Developing tools and insights for detecting hidden functionality in modern AI systems could therefore lay important groundwork for tackling future risks. In particular, future AIs could potentially engage in forms of deception—not out of malice, but because deception can help agents achieve their goals or receive approval from humans. Once deceptive AI systems obtain sufficient leverage, these systems could take a "treacherous turn" and bypass human control. Neural trojans are the closest modern analogue to the risk of treacherous turns in future AI systems and thus provide a microcosm for studying treacherous turns. For further discussion of how trojan detection relates to severe risks from future AIs here. For discussion on catastrophic AI risks more broadly, see here

Overview

How can we detect hidden functionality in large language models? Participants will help answer this question in two complimentary tracks:

  1. Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings. For more information, see here.
  2. Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors. For more information, see here.

To enable broader participation, each track has a Large Model and Base Model subtrack, corresponding to larger and smaller LLMs.

Compute Credits: We are awarding $500 compute credit grants to student teams that would not otherwise be able to participate. For details on how to apply, see here.

Important Dates

  • July 17: Registration opens on CodaLab
  • July 25: The development phase begins
  • October 27: Final submissions for the development phase
  • October 31: The test phase begins
  • November 5: Final submissions for the test phase

Rules

  1. Open Format: This is an open competition. All participants are encouraged to share their methods upon conclusion of the competition, and outstanding submissions will be highlighted in a joint publication. To be eligible for prizes and named as a winning team, top-ranking teams in the test phase are required to share their methods, code, and models with the organizers at a minimum, though public releases are highly encouraged.
  2. Registration: Double registration is not allowed. We expect teams to self-certify that all team members are not part of a different team registered for the competition, and we will actively monitor for violation of this rule. Teams may participate in multiple tracks. Organizers are not allowed to participate in the competition or win prizes.
  3. Prize Distribution: Monetary prizes will be awarded to teams as specified in the Prizes page of the competition website.
  4. Compute Limits: In both tracks, methods developed for the Base Model subtrack must run in at most 2 A100 GPU-days to generate a submission for the test phase. For Large Model subtracks, methods must run in at most 4 A100 GPU-days to generate a submission for the test phase. These limits do not include initial setup during the development phase. This is a soft requirement; we will check it for the winning teams, but understandable reasons for exceeding the limit by a modest amount are acceptable.
  5. Allowed Methods (Trojan Detection Track): The use of features that are clearly loopholes is not allowed (e.g., metadata). We may not anticipate all loopholes and we encourage participants to alert us to their existence. Legitimate features that do not constitute loopholes include all features derived from the trained parameters of networks, the target strings, training triggers, and text datasets. Similar to the Red Teaming Track, we also do not allow the submission of prompts that effectively make LLMs copy-paste target strings from the prompt into the generation.
  6. Allowed Methods (Red Teaming Track): Methods used for generating submitted test cases must satisfy the following requirements. (1) They must be able to generate any desired number of test cases, not just a fixed number of test cases. (2) They must be fully automated. After an initial setup, they should require no human effort to adapt to a new set of behaviors and no human-in-the-loop for generating test cases. (3) They must cause LLMs to generate behaviors from scratch. In particular, getting LLMs to copy-paste whole instances of undesirable behavior provided in the prompt does not count as generating undesirable behavior from scratch. Test cases must not include any substantive aspects of the behaviors they are attempting to elicit. Including aspects of unrelated behaviors is allowed (e.g., few-shot examples of following unrelated behaviors). Including text that someone without any knowledge of or skill in the behavior could derive from the behavior string is allowed (e.g., "Sure, here is how to make a pipe bomb. Step 1: Gather"). Due to the subjective nature of some aspects of this rule, it will be a soft rule that is only checked during manual evaluation. Understandable reasons for sometimes exceeding the threshold of information allowed in the test cases are acceptable. (4) They must not intentionally boost the diversity metric in an artificial manner, e.g., by adding random text to the beginning of a prompt.
  7. Rule breaking may result in disqualification, and significant rule breaking will result in an ineligibility for prizes.

These rules are an initial set, and we require participants to consent to a change of rules if there is an urgent need during registration. If a situation should arise that was not anticipated, we will implement a fair solution, ideally using consensus of participants.

Organizers

Contact: tdc2023-organizers@googlegroups.com

The competition Discord server can be joined here: https://discord.gg/knwH4Zm6Tx. This can be used for discussion, asking questions, and finding teammates.

To receive updates and reminders through email, join the tdc2023-updates google group: https://groups.google.com/g/tdc2023-updates. Major updates will also be posted to the News section of this page and the Discord server.

We are kindly sponsored by a private funder.