dtc111 polaritus commited on
Commit
091f734
·
verified ·
1 Parent(s): 876ff75

Update README.md (#3)

Browse files

- Update README.md (bf2b73f21021e34a342e78c2250e5f462369af48)


Co-authored-by: 李梓鸣 <[email protected]>

Files changed (1) hide show
  1. README.md +343 -3
README.md CHANGED
@@ -1,3 +1,343 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - Somayeh-h/Nordland
5
+ - OPR-Project/OxfordRobotCar_OpenPlaceRecognition
6
+ language:
7
+ - en
8
+ metrics:
9
+ - recall_at_1
10
+ - recall_at_5
11
+ pipeline_tag: image-feature-extraction
12
+ tags:
13
+ - place-recognition
14
+ - visual-place-recognition
15
+ - computer-vision
16
+ - transformer
17
+ - 3d-vision
18
+ library:
19
+ - pytorch
20
+ - lightning
21
+ ---
22
+
23
+ # Model Card for UniPR-3D
24
+
25
+ UniPR-3D is a universal visual place recognition (VPR) framework that supports both **single-frame** and **sequence-to-sequence** matching. It leverages **3D visual geometry grounded tokens** within a transformer architecture to produce robust, viewpoint-invariant descriptors for long-term place recognition under challenging environmental variations (e.g., seasonal, weather, lighting, and viewpoint changes).
26
+
27
+ ## Model Details
28
+
29
+ ### Model Description
30
+
31
+ - **Developed by:** Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang
32
+ - **Shared by:** Tianchen Deng
33
+ - **Model type:** Vision Transformer with 3D-aware token aggregation for visual place recognition
34
+ - **Language(s):** English (dataset metadata); model is vision-only
35
+ - **License:** MIT
36
+
37
+ ### Model Sources
38
+
39
+ - **Repository:** [repo](https://github.com/dtc111111/UniPR-3D)
40
+ - **Paper:** [UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer](https://arxiv.org/abs/2512.21078) (arXiv:2512.21078, 2025)
41
+ - **Demo:** No demo available
42
+
43
+ ## Uses
44
+
45
+ ### Direct Use
46
+
47
+ This model can be used **out-of-the-box** to extract compact, discriminative global descriptors from:
48
+ - Single RGB images (for frame-to-frame VPR)
49
+ - Sequences of images (for sequence-to-sequence VPR)
50
+
51
+ These descriptors are suitable for large-scale localization, robot navigation, and SLAM systems requiring robustness to appearance changes.
52
+
53
+ ### Downstream Use
54
+
55
+ - Integration into **visual SLAM** or **long-term autonomous navigation** pipelines
56
+ - Replacement for traditional VPR backbones (e.g., NetVLAD, MixVPR, EigenPlaces)
57
+ - Fine-tuning on domain-specific datasets (e.g., underground, aerial, or underwater environments)
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ - **Not intended** for real-time inference on low-power embedded devices without optimization (latency ~8.23 ms on RTX 4090)
62
+ - **Not designed** for non-visual modalities (e.g., LiDAR, audio, text)
63
+ - Performance may degrade in **extreme occlusion**, **textureless scenes**, or **indoor environments not seen during training**
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ - Trained primarily on **urban street-level imagery** (GSV-Cities, Mapillary MSLS), so generalization to rural, indoor, or non-Western cities may be limited
68
+ - Inherits biases from training data (e.g., geographic overrepresentation of North America/Europe)
69
+ - No explicit fairness or demographic considerations (as it is a geometric vision model)
70
+
71
+ ### Recommendations
72
+
73
+ - Evaluate on target domain before deployment
74
+ - Monitor recall performance on your specific dataset using standard VPR metrics (R@1, R@5)
75
+
76
+ ## How to Get Started with the Model
77
+
78
+ The exact inference script is provided in the GitHub repo (`eval_lora.py`, `main_ft.py`). Pretrained weights are available on Hugging Face or via the repo release.
79
+
80
+ ## Training Details
81
+
82
+ ### Training Data
83
+
84
+ - **Single-frame model**: Trained on [GSV-Cities](https://github.com/amaralibey/gsv-cities)
85
+ - **Multi-frame model**: Trained on [Mapillary Street-Level Sequences (MSLS)](https://www.mapillary.com/dataset/places)
86
+ - Both datasets contain millions of geo-tagged urban street-view images across diverse cities, seasons, and conditions.
87
+
88
+ ### Training Procedure
89
+
90
+ #### Preprocessing
91
+ - Images resized to 518×518
92
+ - Sequences sampled with spatial proximity for multi-frame training
93
+
94
+ #### Training Hyperparameters
95
+ - **Backbone**: DINOv2 (ViT-large)
96
+ - **Optimization**: AdamW, learning rate scheduling
97
+ - **Loss**: Multi-similarity loss with pair weighting
98
+ - **Training regime**: Mixed-precision (fp16) on NVIDIA GPUs
99
+
100
+ #### Speeds, Sizes, Times
101
+ - **Inference latency**: Single frame - 8.23 ms per image (RTX 4090)
102
+ - **Descriptor dimension**: 17152 (for UniPR-3D)
103
+ - Training time: Not disclosed (multi-day runs on multi-GPU setup)
104
+
105
+ ## Evaluation
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+ - Single frame evaluation:
111
+ - <a href="https://codalab.lisn.upsaclay.fr/competitions/865">MSLS Challenge</a>, where you upload your predictions to their server for evaluation.
112
+ - Single-frame <a href="https://www.mapillary.com/dataset/places">MSLS</a> Validation set
113
+ - Nordland dataset, <a href="https://data.ciirc.cvut.cz/public/projects/2015netVLAD/Pittsburgh250k/">Pittsburgh</a> dataset and SPED dataset, you may download them from <a href="https://surfdrive.surf.nl/index.php/s/sbZRXzYe3l0v67W">here</a>, aligned with DINOv2 SALAD.
114
+ - Multi-frame evaluation:
115
+ - Multi-frame <a href="https://www.mapillary.com/dataset/places">MSLS</a> Validation set
116
+ - Two sequence from <a href="https://robotcar-dataset.robots.ox.ac.uk/datasets/">Oxford RobotCar</a>, you may download them <a href="https://entuedu-my.sharepoint.com/personal/heshan001_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fheshan001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fcasevpr%5Fdatasets%2Foxford%5Frobotcar&viewid=e5dcb0e9%2Db23f%2D44cf%2Da843%2D7837d3064c2e&ga=1">here</a>.
117
+ - 2014-12-16-18-44-24 (winter night) query to 2014-11-18-13-20-12 (fall day) db
118
+ - 2014-11-14-16-34-33 (fall night) query to 2015-11-13-10-28-08 (fall day) db
119
+ - <a href="https://github.com/gmberton/VPR-datasets-downloader/blob/main/download_nordland.py">Nordland (filtered) dataset</a>
120
+
121
+ #### Factors
122
+ - Seasonal variation (summer ↔ winter)
123
+ - Day vs. night
124
+ - Weather (sunny, rainy, snowy)
125
+ - Viewpoint change (lateral shift, orientation)
126
+
127
+ #### Metrics
128
+ - **Recall@K (R@1, R@5, R@10)**: Standard metric for VPR – fraction of queries with correct match in top-K retrieved database images
129
+
130
+ ### Results
131
+
132
+ #### Summary
133
+
134
+ Our method achieves significantly higher recall than competing approaches, achieving new state-of-the-art performance on both single and multiple frame benchmarks.
135
+ ##### Single-frame matching results
136
+
137
+ <style>
138
+ table, th, td {
139
+ border-collapse: collapse;
140
+ text-align: center;
141
+ }
142
+ </style>
143
+ <table>
144
+ <tr>
145
+ <th colspan="2"></th>
146
+ <th colspan="2">MSLS Challenge</th>
147
+ <th colspan="2">MSLS Val</th>
148
+ <th colspan="2">NordLand</th>
149
+ <th colspan="2">Pitts250k-test</th>
150
+ <th colspan="2">SPED</th>
151
+ </tr>
152
+ <tr>
153
+ <th>Method</th>
154
+ <th>Latency (ms)</th>
155
+ <th>R@1</th>
156
+ <th>R@5</th>
157
+ <th>R@1</th>
158
+ <th>R@5</th>
159
+ <th>R@1</th>
160
+ <th>R@5</th>
161
+ <th>R@1</th>
162
+ <th>R@5</th>
163
+ <th>R@1</th>
164
+ <th>R@5</th>
165
+ </tr>
166
+ <tr>
167
+ <td>MixVPR</td>
168
+ <td>1.37</td>
169
+ <td>64.0</td>
170
+ <td>75.9</td>
171
+ <td>88.0</td>
172
+ <td>92.7</td>
173
+ <td>58.4</td>
174
+ <td>74.6</td>
175
+ <td>94.6</td>
176
+ <td><u>98.3</u></td>
177
+ <td>85.2</td>
178
+ <td>92.1</td>
179
+ </tr>
180
+ <tr>
181
+ <td>EigenPlaces</td>
182
+ <td>2.65</td>
183
+ <td>67.4</td>
184
+ <td>77.1</td>
185
+ <td>89.3</td>
186
+ <td>93.7</td>
187
+ <td>54.4</td>
188
+ <td>68.8</td>
189
+ <td>94.1</td>
190
+ <td>98.0</td>
191
+ <td>69.9</td>
192
+ <td>82.9</td>
193
+ </tr>
194
+ <tr>
195
+ <td>DINOv2 SALAD</td>
196
+ <td>2.41</td>
197
+ <td><u>73.0</u></td>
198
+ <td><u>86.8</u></td>
199
+ <td><u>91.2</u></td>
200
+ <td><u>95.3</u></td>
201
+ <td><u>69.6</u></td>
202
+ <td><u>84.4</u></td>
203
+ <td><u>94.5</u></td>
204
+ <td><b>98.7</b></td>
205
+ <td><u>89.5</u></td>
206
+ <td><u>94.4</u></td>
207
+ </tr>
208
+ <tr>
209
+ <td>UniPR-3D (ours)</td>
210
+ <td>8.23</td>
211
+ <td><b>74.3</b></td>
212
+ <td><b>87.5</b></td>
213
+ <td><b>91.4</b></td>
214
+ <td><b>96.0</b></td>
215
+ <td><b>76.2</b></td>
216
+ <td><b>87.3</b></td>
217
+ <td><b>94.9</b></td>
218
+ <td>98.1</td>
219
+ <td><b>89.6</b></td>
220
+ <td><b>94.5</b></td>
221
+ </tr>
222
+ </table>
223
+
224
+ ##### Sequence matching results
225
+
226
+ <table>
227
+ <tr>
228
+ <th></th>
229
+ <th colspan="3">MSLS Val</th>
230
+ <th colspan="3">NordLand</th>
231
+ <th colspan="3">Oxford1</th>
232
+ <th colspan="3">Oxford2</th>
233
+ </tr>
234
+ <tr>
235
+ <th>Method</th>
236
+ <th>R@1</th>
237
+ <th>R@5</th>
238
+ <th>R@10</th>
239
+ <th>R@1</th>
240
+ <th>R@5</th>
241
+ <th>R@10</th>
242
+ <th>R@1</th>
243
+ <th>R@5</th>
244
+ <th>R@10</th>
245
+ <th>R@1</th>
246
+ <th>R@5</th>
247
+ <th>R@10</th>
248
+ </tr>
249
+ <tr>
250
+ <td>SeqMatchNet</td>
251
+ <td>65.5</td>
252
+ <td>77.5</td>
253
+ <td>80.3</td>
254
+ <td>56.1</td>
255
+ <td>71.4</td>
256
+ <td>76.9</td>
257
+ <td>36.8</td>
258
+ <td>43.3</td>
259
+ <td>48.3</td>
260
+ <td>27.9</td>
261
+ <td>38.5</td>
262
+ <td>45.3</td>
263
+ </tr>
264
+ <tr>
265
+ <td>SeqVLAD</td>
266
+ <td>89.9</td>
267
+ <td>92.4</td>
268
+ <td>94.1</td>
269
+ <td>65.5</td>
270
+ <td>75.2</td>
271
+ <td>80.0</td>
272
+ <td>58.4</td>
273
+ <td>72.8</td>
274
+ <td>80.8</td>
275
+ <td>19.1</td>
276
+ <td>29.9</td>
277
+ <td>37.3</td>
278
+ </tr>
279
+ <tr>
280
+ <td>CaseVPR</td>
281
+ <td><u>91.2</u></td>
282
+ <td><u>94.1</u></td>
283
+ <td><u>95.0</u></td>
284
+ <td><u>84.1</u></td>
285
+ <td><u>89.9</u></td>
286
+ <td><u>92.2</u></td>
287
+ <td><u>90.5</u></td>
288
+ <td><u>95.2</u></td>
289
+ <td><u>96.5</u></td>
290
+ <td><u>72.8</u></td>
291
+ <td><u>85.8</u></td>
292
+ <td><u>89.9</u></td>
293
+ </tr>
294
+ <tr>
295
+ <td>UniPR-3D (ours)</td>
296
+ <td><b>93.7</b></td>
297
+ <td><b>95.7</b></td>
298
+ <td><b>96.9</b></td>
299
+ <td><b>86.8</b></td>
300
+ <td><b>91.7</b></td>
301
+ <td><b>93.8</b></td>
302
+ <td><b>95.4</b></td>
303
+ <td><b>98.1</b></td>
304
+ <td><b>98.7</b></td>
305
+ <td><b>80.6</b></td>
306
+ <td><b>90.3</b></td>
307
+ <td><b>93.9</b></td>
308
+ </tr>
309
+ </table>
310
+
311
+
312
+ ## Compute Infrastructure
313
+
314
+ ### Hardware
315
+ - NVIDIA RTX 4090
316
+
317
+ ### Software
318
+ - Python 3.11.10 + CUDA 12.1
319
+ - Based on [SALAD](https://github.com/serizba/salad) and [VGGT](https://github.com/facebookresearch/vggt)
320
+
321
+ ## Citation
322
+
323
+ **BibTeX:**
324
+ ```bibtex
325
+ @article{deng2025unipr3d,
326
+ title={UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer},
327
+ author={Deng, Tianchen and Chen, Xun and Li, Ziming and Shen, Hongming and Wang, Danwei and Civera, Javier and Wang, Hesheng},
328
+ journal={arXiv preprint arXiv:2512.21078},
329
+ year={2025}
330
+ }
331
+ ```
332
+
333
+ **APA:**
334
+ Deng, T., Chen, X., Li, Z., Shen, H., Wang, D., Civera, J., & Wang, H. (2025). UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer. *arXiv preprint arXiv:2512.21078*.
335
+
336
+ ## Contact
337
+
338
+ For questions, pretrained model access, or qualitative comparisons, please contact:
339
+ 📧 **Tianchen Deng** – [[email protected]](mailto:[email protected])
340
+
341
+ ---
342
+
343
+ > 📌 **Acknowledgement**: This implementation builds upon [SALAD](https://github.com/serizba/salad) and [VGGT](https://github.com/facebookresearch/vggt). Please cite those works if you use their components.