File size: 7,789 Bytes
7983357 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
library_name: transformers
datasets:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
base_model:
- answerdotai/ModernBERT-base
---
# wissamantoun/WebOrganizer-TopicClassifier-ModernBERT
[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
*All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model*
The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data:
1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
#### All Domain Classifiers
- [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT)
- [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT) *← you are here!*
## Usage
This classifier expects input in the following input format:
```
{url}
{text}
```
Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
"wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to build a computer from scratch? Here are the components you need..."""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)
```
You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
1. Adult
2. Art & Design
3. Software Dev.
4. Crime & Law
5. Education & Jobs
6. Hardware
7. Entertainment
8. Social Life
9. Fashion & Beauty
10. Finance & Business
11. Food & Dining
12. Games
13. Health
14. History
15. Home & Hobbies
16. Industrial
17. Literature
18. Politics
19. Religion
20. Science & Tech.
21. Software
22. Sports & Fitness
23. Transportation
24. Travel
The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml).
# Scores
```
***** pred metrics *****
test_accuracy = 0.8585
test_accuracy__0 = 0.9346
test_accuracy__1 = 0.7317
test_accuracy__10 = 0.9148
test_accuracy__11 = 0.8927
test_accuracy__12 = 0.8687
test_accuracy__13 = 0.814
test_accuracy__14 = 0.8616
test_accuracy__15 = 0.7179
test_accuracy__16 = 0.855
test_accuracy__17 = 0.8246
test_accuracy__18 = 0.907
test_accuracy__19 = 0.8333
test_accuracy__2 = 0.866
test_accuracy__20 = 0.8294
test_accuracy__21 = 0.9441
test_accuracy__22 = 0.8788
test_accuracy__23 = 0.9
test_accuracy__3 = 0.847
test_accuracy__4 = 0.8442
test_accuracy__5 = 0.8189
test_accuracy__6 = 0.8997
test_accuracy__7 = 0.7295
test_accuracy__8 = 0.8937
test_accuracy__9 = 0.8665
test_accuracy_conf50 = 0.8674
test_accuracy_conf50__0 = 0.9434
test_accuracy_conf50__1 = 0.7453
test_accuracy_conf50__10 = 0.93
test_accuracy_conf50__11 = 0.8958
test_accuracy_conf50__12 = 0.8768
test_accuracy_conf50__13 = 0.8193
test_accuracy_conf50__14 = 0.8691
test_accuracy_conf50__15 = 0.7237
test_accuracy_conf50__16 = 0.864
test_accuracy_conf50__17 = 0.8358
test_accuracy_conf50__18 = 0.91
test_accuracy_conf50__19 = 0.8481
test_accuracy_conf50__2 = 0.8768
test_accuracy_conf50__20 = 0.8434
test_accuracy_conf50__21 = 0.9505
test_accuracy_conf50__22 = 0.8844
test_accuracy_conf50__23 = 0.9028
test_accuracy_conf50__3 = 0.8571
test_accuracy_conf50__4 = 0.851
test_accuracy_conf50__5 = 0.8206
test_accuracy_conf50__6 = 0.9071
test_accuracy_conf50__7 = 0.7442
test_accuracy_conf50__8 = 0.9006
test_accuracy_conf50__9 = 0.8761
test_accuracy_conf75 = 0.9178 <--- Metric from the paper
test_accuracy_conf75__0 = 0.95
test_accuracy_conf75__1 = 0.8413
test_accuracy_conf75__10 = 0.9556
test_accuracy_conf75__11 = 0.9298
test_accuracy_conf75__12 = 0.9299
test_accuracy_conf75__13 = 0.8788
test_accuracy_conf75__14 = 0.9126
test_accuracy_conf75__15 = 0.8253
test_accuracy_conf75__16 = 0.8885
test_accuracy_conf75__17 = 0.8968
test_accuracy_conf75__18 = 0.938
test_accuracy_conf75__19 = 0.9113
test_accuracy_conf75__2 = 0.9029
test_accuracy_conf75__20 = 0.8966
test_accuracy_conf75__21 = 0.968
test_accuracy_conf75__22 = 0.9225
test_accuracy_conf75__23 = 0.9444
test_accuracy_conf75__3 = 0.9319
test_accuracy_conf75__4 = 0.8976
test_accuracy_conf75__5 = 0.9167
test_accuracy_conf75__6 = 0.9483
test_accuracy_conf75__7 = 0.804
test_accuracy_conf75__8 = 0.9448
test_accuracy_conf75__9 = 0.932
test_accuracy_label_average = 0.8531
test_accuracy_label_average_conf50 = 0.8615
test_accuracy_label_average_conf75 = 0.9111
test_accuracy_label_min = 0.7179
test_accuracy_label_min_conf50 = 0.7237
test_accuracy_label_min_conf75 = 0.804 <--- Metric from the paper
test_loss = 0.4694
test_proportion_conf50 = 0.9811
test_proportion_conf75 = 0.8535
test_runtime = 0:00:08.39
test_samples_per_second = 1191.144
test_steps_per_second = 37.283
```
## Citation
```bibtex
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}
``` |