Update README.md (#2)
Browse files- Update README.md (6d38ed3f2975c447578067142b8e269f5b37c350)
Co-authored-by: David Brandfonbrener <[email protected]>
README.md
CHANGED
|
@@ -2,6 +2,9 @@
|
|
| 2 |
|
| 3 |
See accompanying code at: https://github.com/davidbrandfonbrener/color-filter-olmo
|
| 4 |
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
To download the data, we recommend using the huggingface-cli.
|
| 7 |
|
|
@@ -18,10 +21,10 @@ If you only want to download some files (e.g. just the models), use the cli. For
|
|
| 18 |
If you use this code in your research, please cite the following paper:
|
| 19 |
|
| 20 |
```bibtex
|
| 21 |
-
@
|
| 22 |
-
title={},
|
| 23 |
-
author={},
|
| 24 |
-
|
| 25 |
-
year={}
|
| 26 |
}
|
| 27 |
```
|
|
|
|
| 2 |
|
| 3 |
See accompanying code at: https://github.com/davidbrandfonbrener/color-filter-olmo
|
| 4 |
|
| 5 |
+
If you only want to download the filtered, untokenized data, see: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
|
| 6 |
+
|
| 7 |
+
## Usage
|
| 8 |
|
| 9 |
To download the data, we recommend using the huggingface-cli.
|
| 10 |
|
|
|
|
| 21 |
If you use this code in your research, please cite the following paper:
|
| 22 |
|
| 23 |
```bibtex
|
| 24 |
+
@article{brandfonbrener2024color,
|
| 25 |
+
title={CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training},
|
| 26 |
+
author={Brandfonbrener, David and Zhang, Hanlin and Kirsch, Andreas and Schwarz, Jonathan Richard and Kakade, Sham M},
|
| 27 |
+
journal={arXiv preprint arXiv:XXXX.XXXXX},
|
| 28 |
+
year={2024}
|
| 29 |
}
|
| 30 |
```
|