EleutherAI
/

deep-ignorance-strong-filter-pt-weak-filter-anneal-cb

- Improve model card: Add metadata, paper/project/code links, and abstract (cb928bb3c94fe413783e7b350bb5b12c13d4f252)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +47 -35

README.md CHANGED Viewed

@@ -1,6 +1,14 @@
 ---
 language:
 - en
 tags:
 - pytorch
 - causal-lm
@@ -25,19 +33,25 @@ tags:
 - safety-research
 - model-diffing
 - training-dynamics
-license: apache-2.0
-datasets:
-- EleutherAI/deep-ignorance-pretraining-mix
-- EleutherAI/deep-ignorance-annealing-mix
-base_model:
-- EleutherAI/deep-ignorance-pretraining-stage-unfiltered
 ---
 # Deep Ignorance Model Suite
 We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as CBRN) by filtering out enough of the relevant pretraining data before we begin training a model? Research into this question resulted in the **Deep Ignorance Suite**. In our experimental setup, we find that filtering pretraining data prevents undesirable knowledge, doesn't sacrifice general performance, and results in models that are resistant to tampering.
-Deep Ignorance is a collection of 6.9B models developed to facilitate research into pretraining, interpretability, training data, and unlearning [(see paper)](https://deepignorance.ai). It contains 18 models composing of a baseline model trained on unfiltered data, and 17 models trained on filtered datasets or with other safety interventions being applied. Pretraining stage models have 101 checkpoints and annealing stage have 11.
 > **Support:**
 > The #release-discussion channel in the [EleutherAI Discord](https://discord.gg/eleutherai) is the best place to ask questions. Questions asked in other channels are less likely to be answered. The community section on HuggingFace is less actively monitored. Tag Kyle O'Brien in the EleutherAI Discord for faster response times.
@@ -51,14 +65,11 @@ Our research and model suite open up multiple avenues for future work. For insta
 We are also excited for the community to stress test data filtering to determine whether there are some situations where it is less tamper-resistant than our experiments suggest! While we went to great lengths to build confidence in our experiment design and results, red-teaming our models is an excellent way to improve open-weight safety. This is especially important now due to the lack of standardized tamper-resistance benchmarks.
 ## Uses and Limitations
 ### Quickstart
-We recommend starting with the following models as these are the ones studied most extensively in our paper.
 | Model | Pretraining Filtering | Annealing Filtering | Post-training |
 |:------|:---------------------|:-------------------|:--------------|
@@ -67,24 +78,25 @@ We recommend starting with the following models as these are the ones studied mo
 | [deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) | Strong Filter | Strong Filter | - |
 | [deep-ignorance-unfiltered-cb-lat](https://huggingface.co/EleutherAI/deep-ignorance-unfiltered-cb-lat) | - | - | Circuit Breaking + Latent Adversarial Training |
-All models can be loaded for training and inference using HuggingFace transformers.
 ```python
 from transformers import GPTNeoXForCausalLM, AutoTokenizer
 model = GPTNeoXForCausalLM.from_pretrained(
-  "EleutherAI/deep-ignorance-strong-filter-pt-weak-filter-anneal",
-  revision="global_step11921",
 )
 tokenizer = AutoTokenizer.from_pretrained(
-  "EleutherAI/deep-ignorance-strong-filter-pt-weak-filter-anneal",
-  revision="global_step11921",
 )
 inputs = tokenizer("Hello, I am", return_tensors="pt")
 tokens = model.generate(**inputs)
-tokenizer.decode(tokens[0])
 ```
 Revision/branch `global_step11921` corresponds exactly to the model checkpoint on the `main` branch of each model. Specifying the revision allows you to load intermediate checkpoints. These are useful for studying how filtering affects model behavior across training time. Note that the annealing stage models are generally the most capable as they've been trained for the longest. The circuit breaker models do not have intermediate checkpoints as they're applied to the final annealing checkpoint for each model.
@@ -157,23 +169,23 @@ To ensure our filtering approach preserves beneficial knowledge, we evaluate on
 - **LAMBADA**: Text comprehension requiring full-context understanding
 - **HellaSwag**: Commonsense natural language inference
-| Model                                                                | Pretraining Filtering   | Annealing Filtering   | WMDP Bio Average (Robust MCQA, Verified Cloze) (↓)   | Average (MMLU, PIQA, Lambada, HellaSwag) (↑)   | WMDP Bio Robust MCQA (↓)   | WMDP Bio Verified Cloze (↓)   | MMLU (↑)       | PIQA (↑)       | Lambada (↑)    | HellaSwag (↑)   |
-|:---------------------------------------------------------------------|:------------------------|:----------------------|:-----------------------------------------------------|:-----------------------------------------------|:---------------------------|:------------------------------|:---------------|:---------------|:---------------|:----------------|
-| deep-ignorance-unfiltered                                 | -                    | -                  | 39.66%                                        | 56.05%                                  | 42.97%              | 36.34%                 | 44.92%  | 76.44%  | 47.08%  | 55.75%   |
-| deep-ignorance-pretraining-stage-unfiltered               | -                    | -                  | 37.16% (-2.50)                                       | 60.24% (4.19)                                  | 38.25% (-4.72)             | 36.06% (-0.28)                | 42.80% (-2.12) | 79.05% (2.61)  | 63.03% (15.95) | 56.06% (0.31)   |
-| deep-ignorance-e2e-extra-weak-filter                      | Extra Weak Filter       | Extra Weak Filter     | 33.70% (-5.96)                                       | 55.83% (-0.22)                                 | 38.02% (-4.95)             | 29.37% (-6.97)                | 44.13% (-0.79) | 77.04% (0.60)  | 46.85% (-0.23) | 55.29% (-0.46)  |
-| deep-ignorance-weak-filter-pt-strong-filter-anneal        | Weak Filter             | Strong Filter         | 30.97% (-8.69)                                       | 56.22% (0.17)                                  | 36.75% (-6.22)             | 25.19% (-11.15)               | 43.16% (-1.76) | 77.20% (0.76)  | 48.86% (1.78)  | 55.67% (-0.08)  |
-| deep-ignorance-e2e-weak-filter                            | Weak Filter             | Weak Filter           | 30.50% (-9.16)                                       | 57.37% (1.32)                                  | 35.25% (-7.72)             | 25.74% (-10.60)               | 43.91% (-1.01) | 78.35% (1.91)  | 51.81% (4.73)  | 55.41% (-0.34)  |
-| deep-ignorance-strong-filter-pt-weak-filter-anneal        | Strong Filter           | Weak Filter           | 30.38% (-9.28)                                       | 57.88% (1.83)                                  | 33.99% (-8.98)             | 26.77% (-9.57)                | 44.82% (-0.10) | 76.88% (0.44)  | 54.05% (6.97)  | 55.78% (0.03)   |
-| deep-ignorance-e2e-strong-filter                          | Strong Filter           | Strong Filter         | 29.90% (-9.76)                                       | 55.53% (-0.52)                                 | 35.37% (-7.60)             | 24.44% (-11.90)               | 43.21% (-1.71) | 75.73% (-0.71) | 47.29% (0.21)  | 55.90% (0.15)   |
-| deep-ignorance-pretraining-stage-strong-filter            | Strong Filter           | -                  | 29.47% (-10.19)                                      | 60.02% (3.97)                                  | 33.29% (-9.68)             | 25.65% (-10.69)               | 43.46% (-1.46) | 79.27% (2.83)  | 60.82% (13.74) | 56.53% (0.78)   |
-| deep-ignorance-unfiltered-cb                              | -                    | -                  | 29.29% (-10.37)                                      | 54.11% (-1.94)                                 | 29.49% (-13.48)            | 29.09% (-7.25)                | 43.61% (-1.31) | 76.50% (0.06)  | 45.84% (-1.24) | 50.50% (-5.25)  |
-| deep-ignorance-pretraining-stage-weak-filter              | Weak Filter             | -                  | 29.12% (-10.54)                                      | 58.98% (2.93)                                  | 33.53% (-9.44)             | 24.72% (-11.62)               | 41.04% (-3.88) | 78.78% (2.34)  | 60.57% (13.49) | 55.53% (-0.22)  |
-| deep-ignorance-strong-filter-pt-weak-filter-anneal-cb-lat | Strong Filter           | Weak Filter           | 26.92% (-12.74)                                      | 58.00% (1.95)                                  | 29.95% (-13.02)            | 23.88% (-12.46)               | 43.52% (-1.40) | 76.61% (0.17)  | 56.01% (8.93)  | 55.84% (0.09)   |
-| deep-ignorance-strong-filter-pt-weak-filter-anneal-cb     | Strong Filter           | Weak Filter           | 26.12% (-13.54)                                      | 56.46% (0.41)                                  | 25.46% (-17.51)            | 26.77% (-9.57)                | 41.45% (-3.47) | 76.33% (-0.11) | 53.64% (6.56)  | 54.40% (-1.35)  |
-| deep-ignorance-unfiltered-cb-lat                          | -                    | -                  | 25.93% (-13.73)                                      | 56.43% (0.38)                                  | 27.42% (-15.55)            | 24.44% (-11.90)               | 42.73% (-2.19) | 76.22% (-0.22) | 51.85% (4.77)  | 54.92% (-0.83)  |
-| deep-ignorance-e2e-strong-filter-cb-lat                   | Strong Filter           | Strong Filter         | 25.87% (-13.79)                                      | 56.60% (0.55)                                  | 27.76% (-15.21)            | 23.98% (-12.36)               | 42.08% (-2.84) | 75.41% (-1.03) | 52.75% (5.67)  | 56.18% (0.43)   |
-| deep-ignorance-e2e-strong-filter-cb                       | Strong Filter           | Strong Filter         | 25.56% (-14.10)                                      | 52.60% (-3.45)                                 | 25.00% (-17.97)            | 26.12% (-10.22)               | 39.45% (-5.47) | 75.35% (-1.09) | 47.56% (0.48)  | 48.03% (-7.72)  |
 # Acknowledgments

 ---
+base_model:
+- EleutherAI/deep-ignorance-pretraining-stage-unfiltered
+datasets:
+- EleutherAI/deep-ignorance-pretraining-mix
+- EleutherAI/deep-ignorance-annealing-mix
 language:
 - en
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
 tags:
 - pytorch
 - causal-lm
 - safety-research
 - model-diffing
 - training-dynamics
 ---
 # Deep Ignorance Model Suite
+This repository contains the **Deep Ignorance** model suite, introduced in the paper:
+[**Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs**](https://huggingface.co/papers/2508.06601).
+**Project Page**: [https://deepignorance.ai/](https://deepignorance.ai/)
+**Codebase**: [https://github.com/EleutherAI/deep-ignorance](https://github.com/EleutherAI/deep-ignorance)
+**Models Collection**: [https://huggingface.co/collections/EleutherAI/deep-ignorance-685441040d024a0fee593d68](https://huggingface.co/collections/EleutherAI/deep-ignorance-685441040d024a0fee593d68)
+## Abstract
+Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.
+## Overview
 We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as CBRN) by filtering out enough of the relevant pretraining data before we begin training a model? Research into this question resulted in the **Deep Ignorance Suite**. In our experimental setup, we find that filtering pretraining data prevents undesirable knowledge, doesn't sacrifice general performance, and results in models that are resistant to tampering.
+Deep Ignorance is a collection of 6.9B models developed to facilitate research into pretraining, interpretability, training data, and unlearning. It contains 18 models composing of a baseline model trained on unfiltered data, and 17 models trained on filtered datasets or with other safety interventions being applied. Pretraining stage models have 101 checkpoints and annealing stage have 11.
 > **Support:**
 > The #release-discussion channel in the [EleutherAI Discord](https://discord.gg/eleutherai) is the best place to ask questions. Questions asked in other channels are less likely to be answered. The community section on HuggingFace is less actively monitored. Tag Kyle O'Brien in the EleutherAI Discord for faster response times.
 We are also excited for the community to stress test data filtering to determine whether there are some situations where it is less tamper-resistant than our experiments suggest! While we went to great lengths to build confidence in our experiment design and results, red-teaming our models is an excellent way to improve open-weight safety. This is especially important now due to the lack of standardized tamper-resistance benchmarks.
 ## Uses and Limitations
 ### Quickstart
+We recommend starting with the following models as these are the ones studied most extensively in our paper. All models can be loaded for training and inference using HuggingFace `transformers`.
 | Model | Pretraining Filtering | Annealing Filtering | Post-training |
 |:------|:---------------------|:-------------------|:--------------|
 | [deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) | Strong Filter | Strong Filter | - |
 | [deep-ignorance-unfiltered-cb-lat](https://huggingface.co/EleutherAI/deep-ignorance-unfiltered-cb-lat) | - | - | Circuit Breaking + Latent Adversarial Training |
 ```python
 from transformers import GPTNeoXForCausalLM, AutoTokenizer
+model_name = "EleutherAI/deep-ignorance-strong-filter-pt-weak-filter-anneal"
+revision_id = "global_step11921" # You can specify other intermediate checkpoints or omit for latest
 model = GPTNeoXForCausalLM.from_pretrained(
+  model_name,
+  revision=revision_id,
 )
 tokenizer = AutoTokenizer.from_pretrained(
+  model_name,
+  revision=revision_id,
 )
 inputs = tokenizer("Hello, I am", return_tensors="pt")
 tokens = model.generate(**inputs)
+print(tokenizer.decode(tokens[0]))
 ```
 Revision/branch `global_step11921` corresponds exactly to the model checkpoint on the `main` branch of each model. Specifying the revision allows you to load intermediate checkpoints. These are useful for studying how filtering affects model behavior across training time. Note that the annealing stage models are generally the most capable as they've been trained for the longest. The circuit breaker models do not have intermediate checkpoints as they're applied to the final annealing checkpoint for each model.
 - **LAMBADA**: Text comprehension requiring full-context understanding
 - **HellaSwag**: Commonsense natural language inference
+| Model | Pretraining Filtering | Annealing Filtering | WMDP Bio Average (Robust MCQA, Verified Cloze) (↓) | Average (MMLU, PIQA, Lambada, HellaSwag) (↑) | WMDP Bio Robust MCQA (↓) | WMDP Bio Verified Cloze (↓) | MMLU (↑) | PIQA (↑) | Lambada (↑) | HellaSwag (↑) |
+|:------|:------------------------|:----------------------|:-----------------------------------------------------|:-----------------------------------------------|:---------------------------|:------------------------------|:---------------|:---------------|:---------------|:----------------|
+| deep-ignorance-unfiltered | - | - | 39.66% | 56.05% | 42.97% | 36.34% | 44.92% | 76.44% | 47.08% | 55.75% |
+| deep-ignorance-pretraining-stage-unfiltered | - | - | 37.16% (-2.50) | 60.24% (4.19) | 38.25% (-4.72) | 36.06% (-0.28) | 42.80% (-2.12) | 79.05% (2.61) | 63.03% (15.95) | 56.06% (0.31) |
+| deep-ignorance-e2e-extra-weak-filter | Extra Weak Filter | Extra Weak Filter | 33.70% (-5.96) | 55.83% (-0.22) | 38.02% (-4.95) | 29.37% (-6.97) | 44.13% (-0.79) | 77.04% (0.60) | 46.85% (-0.23) | 55.29% (-0.46) |
+| deep-ignorance-weak-filter-pt-strong-filter-anneal | Weak Filter | Strong Filter | 30.97% (-8.69) | 56.22% (0.17) | 36.75% (-6.22) | 25.19% (-11.15) | 43.16% (-1.76) | 77.20% (0.76) | 48.86% (1.78) | 55.67% (-0.08) |
+| deep-ignorance-e2e-weak-filter | Weak Filter | Weak Filter | 30.50% (-9.16) | 57.37% (1.32) | 35.25% (-7.72) | 25.74% (-10.60) | 43.91% (-1.01) | 78.35% (1.91) | 51.81% (4.73) | 55.41% (-0.34) |
+| deep-ignorance-strong-filter-pt-weak-filter-anneal | Strong Filter | Weak Filter | 30.38% (-9.28) | 57.88% (1.83) | 33.99% (-8.98) | 26.77% (-9.57) | 44.82% (-0.10) | 76.88% (0.44) | 54.05% (6.97) | 55.78% (0.03) |
+| deep-ignorance-e2e-strong-filter | Strong Filter | Strong Filter | 29.90% (-9.76) | 55.53% (-0.52) | 35.37% (-7.60) | 24.44% (-11.90) | 43.21% (-1.71) | 75.73% (-0.71) | 47.29% (0.21) | 55.90% (0.15) |
+| deep-ignorance-pretraining-stage-strong-filter | Strong Filter | - | 29.47% (-10.19) | 60.02% (3.97) | 33.29% (-9.68) | 25.65% (-10.69) | 43.46% (-1.46) | 79.27% (2.83) | 60.82% (13.74) | 56.53% (0.78) |
+| deep-ignorance-unfiltered-cb | - | - | 29.29% (-10.37) | 54.11% (-1.94) | 29.49% (-13.48) | 29.09% (-7.25) | 43.61% (-1.31) | 76.50% (0.06) | 45.84% (-1.24) | 50.50% (-5.25) |
+| deep-ignorance-pretraining-stage-weak-filter | Weak Filter | - | 29.12% (-10.54) | 58.98% (2.93) | 33.53% (-9.44) | 24.72% (-11.62) | 41.04% (-3.88) | 78.78% (2.34) | 60.57% (13.49) | 55.53% (-0.22) |
+| deep-ignorance-strong-filter-pt-weak-filter-anneal-cb-lat | Strong Filter | Weak Filter | 26.92% (-12.74) | 58.00% (1.95) | 29.95% (-13.02) | 23.88% (-12.46) | 43.52% (-1.40) | 76.61% (0.17) | 56.01% (8.93) | 55.84% (0.09) |
+| deep-ignorance-strong-filter-pt-weak-filter-anneal-cb | Strong Filter | Weak Filter | 26.12% (-13.54) | 56.46% (0.41) | 25.46% (-17.51) | 26.77% (-9.57) | 41.45% (-3.47) | 76.33% (-0.11) | 53.64% (6.56) | 54.40% (-1.35) |
+| deep-ignorance-unfiltered-cb-lat | - | - | 25.93% (-13.73) | 56.43% (0.38) | 27.42% (-15.55) | 24.44% (-11.90) | 42.73% (-2.19) | 76.22% (-0.22) | 51.85% (4.77) | 54.92% (-0.83) |
+| deep-ignorance-e2e-strong-filter-cb-lat | Strong Filter | Strong Filter | 25.87% (-13.79) | 56.60% (0.55) | 27.76% (-15.21) | 23.98% (-12.36) | 42.08% (-2.84) | 75.41% (-1.03) | 52.75% (5.67) | 56.18% (0.43) |
+| deep-ignorance-e2e-strong-filter-cb | Strong Filter | Strong Filter | 25.56% (-14.10) | 52.60% (-3.45) | 25.00% (-17.97) | 26.12% (-10.22) | 39.45% (-5.47) | 75.35% (-1.09) | 47.56% (0.48) | 48.03% (-7.72) |
 # Acknowledgments