File size: 5,296 Bytes
bdd2db4
 
 
 
 
 
 
cc87b07
 
 
bdd2db4
 
cc87b07
bdd2db4
e9914e9
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
 
bdd2db4
cc87b07
b41ea7a
 
bdd2db4
cc87b07
 
bdd2db4
cc87b07
 
 
bdd2db4
cc87b07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
 
bdd2db4
cc87b07
b41ea7a
cc87b07
bdd2db4
cc87b07
 
bdd2db4
cc87b07
 
 
 
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
e9914e9
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
aa94d75
 
 
 
bdd2db4
cc87b07
 
 
bdd2db4
50ea01d
 
cc87b07
bdd2db4
cc87b07
 
 
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
 
 
 
 
 
bdd2db4
cc87b07
 
50ea01d
cc87b07
 
 
e9914e9
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
 
 
 
 
 
 
 
 
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
cc87b07
bdd2db4
 
 
cc87b07
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- datology
- clip
- vision
- OpenCLIP
- datacomp
- zero-shot-classification
---

# DatologyAI CLIP Classification Optimized ViT-B/32

**DatologyAI CLIP** is a state-of-the-art contrastive vision-language model that achieves superior performance through advanced data curation alone, without any architectural or training modifications. This classification-optimized ViT-B/32 model outperforms SigLIP2, MetaCLIP, and DFN on zero-shot classification benchmarks.

## Model Description

DatologyAI's CLIP model demonstrates that careful data curation can drive state-of-the-art performance without modifications to model architecture or training paradigms. Key achievements include:

- **76.91% ImageNet1k accuracy** (vs 74.0% for SigLIP2)
- **8x training efficiency** compared to standard approaches
- Trained on 13B curated image-text pairs from DataComp
- Standard CLIP architecture and training procedure

## Intended Uses

You can use this model for zero-shot image classification or as a vision encoder for VLMs and other vision tasks.

### Zero-shot Image Classification

```python
import torch
from PIL import Image
import open_clip

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/cls-opt-vit-b-32')

# Load image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Define candidate labels
labels = ["a dog", "a cat", "a bird"]
text = tokenizer(labels)

# Run inference
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
# Get predictions
values, indices = similarity[0].topk(3)
for value, index in zip(values, indices):
    print(f"{labels[index]}: {value.item():.2%}")
```

### Image Encoding

```python
import torch
from PIL import Image
import open_clip

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
model.eval()

# Process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Extract features
with torch.no_grad():
    image_features = model.encode_image(image)
    
print(f"Feature shape: {image_features.shape}")  # [1, 512]
```

## Training Procedure

DatologyAI's training pipeline focuses on sophisticated data curation techniques including:

1. **Improved target distribution matching** - Task-specific alignment of image features for classification
2. **Enhanced synthetic data generation** - Optimized caption generation for classification tasks
3. **Predictive metrics for curation quality** - Rapid iteration without full model training

The model uses standard CLIP training objectives with no architectural modifications.

## Training Data

The model was trained on 13B image-text (multi-epoch) curated from the **DataComp-XL** dataset using DatologyAI's proprietary curation pipeline. The curation process selected high-quality, classification-relevant subsets from the 10B available pairs in DataComp-XL.

## Evaluation Results

### Zero-shot Classification Performance

| Benchmark | DatologyAI | SigLIP2 | MetaCLIP |
|-----------|------------|---------|----------|
| ImageNet1k | **76.91%** | 74.0% | 67.7% |
| ImageNetv2 | **70.2%** | 67.1% | 60.4% |

### Training Efficiency
- Matches SigLIP2 performance with only **5B samples** (87.5% compute reduction)
- Matches MetaCLIP performance with only **1B samples** (92% compute reduction)

Full details see [blog post]().

## Model Details

- **Developed by:** DatologyAI
- **Model type:** CLIP (Contrastive Language-Image Pre-training)
- **Architecture:** Vision Transformer B/32
- **License:** Apache 2.0
- **Training framework:** OpenCLIP 2.24.0

## Technical Specifications

### Model Architecture
- **Vision Encoder:** ViT-B/32 (86M parameters)
  - Patch size: 32×32
  - Image size: 224×224
  - Embedding dimension: 512
- **Text Encoder:** 12-layer Transformer
  - Context length: 77 tokens
  - Vocabulary size: 49,408 (BPE tokenizer)

### Training Configuration
- **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6)
- **Learning rate:** 5.0e-04 with cosine schedule
- **Weight decay:** 0.1
- **Batch size:** 32,768
- **Training samples:** 13B image-text pairs
- **Hardware:** Distributed training on H100 GPUs

## Citation

If you use this model, please cite:

```bibtex
@article{datologyai2025clip,
  title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
  author={DatologyAI Team},
  journal={DatologyAI Blog},
  year={2025},
  url={https://datologyai.com/blog/clip-data-upgrade}
}
```

## Additional Information

For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).

**Contact:** [[email protected]](mailto:[email protected])

## Model Card Contact

DatologyAI Team - [[email protected]](mailto:[email protected])