File size: 7,789 Bytes
7983357
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
library_name: transformers
datasets:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
base_model:
- answerdotai/ModernBERT-base
---
# wissamantoun/WebOrganizer-TopicClassifier-ModernBERT

[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

*All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model*

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data:

1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

#### All Domain Classifiers
- [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT)
- [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT) *← you are here!*

## Usage

This classifier expects input in the following input format:
```
{url}

{text}
```

Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
    "wissamantoun/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)
```

You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
1. Adult
2. Art & Design
3. Software Dev.
4. Crime & Law
5. Education & Jobs
6. Hardware
7. Entertainment
8. Social Life
9. Fashion & Beauty
10. Finance & Business
11. Food & Dining
12. Games
13. Health
14. History
15. Home & Hobbies
16. Industrial
17. Literature
18. Politics
19. Religion
20. Science & Tech.
21. Software
22. Sports & Fitness
23. Transportation
24. Travel

The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml).

# Scores
```
***** pred metrics *****
  test_accuracy                      =     0.8585
  test_accuracy__0                   =     0.9346
  test_accuracy__1                   =     0.7317
  test_accuracy__10                  =     0.9148
  test_accuracy__11                  =     0.8927
  test_accuracy__12                  =     0.8687
  test_accuracy__13                  =      0.814
  test_accuracy__14                  =     0.8616
  test_accuracy__15                  =     0.7179
  test_accuracy__16                  =      0.855
  test_accuracy__17                  =     0.8246
  test_accuracy__18                  =      0.907
  test_accuracy__19                  =     0.8333
  test_accuracy__2                   =      0.866
  test_accuracy__20                  =     0.8294
  test_accuracy__21                  =     0.9441
  test_accuracy__22                  =     0.8788
  test_accuracy__23                  =        0.9
  test_accuracy__3                   =      0.847
  test_accuracy__4                   =     0.8442
  test_accuracy__5                   =     0.8189
  test_accuracy__6                   =     0.8997
  test_accuracy__7                   =     0.7295
  test_accuracy__8                   =     0.8937
  test_accuracy__9                   =     0.8665
  test_accuracy_conf50               =     0.8674
  test_accuracy_conf50__0            =     0.9434
  test_accuracy_conf50__1            =     0.7453
  test_accuracy_conf50__10           =       0.93
  test_accuracy_conf50__11           =     0.8958
  test_accuracy_conf50__12           =     0.8768
  test_accuracy_conf50__13           =     0.8193
  test_accuracy_conf50__14           =     0.8691
  test_accuracy_conf50__15           =     0.7237
  test_accuracy_conf50__16           =      0.864
  test_accuracy_conf50__17           =     0.8358
  test_accuracy_conf50__18           =       0.91
  test_accuracy_conf50__19           =     0.8481
  test_accuracy_conf50__2            =     0.8768
  test_accuracy_conf50__20           =     0.8434
  test_accuracy_conf50__21           =     0.9505
  test_accuracy_conf50__22           =     0.8844
  test_accuracy_conf50__23           =     0.9028
  test_accuracy_conf50__3            =     0.8571
  test_accuracy_conf50__4            =      0.851
  test_accuracy_conf50__5            =     0.8206
  test_accuracy_conf50__6            =     0.9071
  test_accuracy_conf50__7            =     0.7442
  test_accuracy_conf50__8            =     0.9006
  test_accuracy_conf50__9            =     0.8761
  test_accuracy_conf75               =     0.9178 <--- Metric from the paper
  test_accuracy_conf75__0            =       0.95
  test_accuracy_conf75__1            =     0.8413
  test_accuracy_conf75__10           =     0.9556
  test_accuracy_conf75__11           =     0.9298
  test_accuracy_conf75__12           =     0.9299
  test_accuracy_conf75__13           =     0.8788
  test_accuracy_conf75__14           =     0.9126
  test_accuracy_conf75__15           =     0.8253
  test_accuracy_conf75__16           =     0.8885
  test_accuracy_conf75__17           =     0.8968
  test_accuracy_conf75__18           =      0.938
  test_accuracy_conf75__19           =     0.9113
  test_accuracy_conf75__2            =     0.9029
  test_accuracy_conf75__20           =     0.8966
  test_accuracy_conf75__21           =      0.968
  test_accuracy_conf75__22           =     0.9225
  test_accuracy_conf75__23           =     0.9444
  test_accuracy_conf75__3            =     0.9319
  test_accuracy_conf75__4            =     0.8976
  test_accuracy_conf75__5            =     0.9167
  test_accuracy_conf75__6            =     0.9483
  test_accuracy_conf75__7            =      0.804
  test_accuracy_conf75__8            =     0.9448
  test_accuracy_conf75__9            =      0.932
  test_accuracy_label_average        =     0.8531
  test_accuracy_label_average_conf50 =     0.8615
  test_accuracy_label_average_conf75 =     0.9111
  test_accuracy_label_min            =     0.7179
  test_accuracy_label_min_conf50     =     0.7237
  test_accuracy_label_min_conf75     =      0.804 <--- Metric from the paper
  test_loss                          =     0.4694
  test_proportion_conf50             =     0.9811
  test_proportion_conf75             =     0.8535
  test_runtime                       = 0:00:08.39
  test_samples_per_second            =   1191.144
  test_steps_per_second              =     37.283
```



## Citation
```bibtex
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
```