--- license: apache-2.0 pipeline_tag: zero-shot-image-classification tags: - datology - clip - vision - OpenCLIP - datacomp - zero-shot-classification --- # DatologyAI CLIP Classification Optimized ViT-B/32 **DatologyAI CLIP** is a state-of-the-art contrastive vision-language model that achieves superior performance through advanced data curation alone, without any architectural or training modifications. This classification-optimized ViT-B/32 model outperforms SigLIP2, MetaCLIP, and DFN on zero-shot classification benchmarks. ## Model Description DatologyAI's CLIP model demonstrates that careful data curation can drive state-of-the-art performance without modifications to model architecture or training paradigms. Key achievements include: - **76.91% ImageNet1k accuracy** (vs 74.0% for SigLIP2) - **8x training efficiency** compared to standard approaches - Trained on 13B curated image-text pairs from DataComp - Standard CLIP architecture and training procedure ## Intended Uses You can use this model for zero-shot image classification or as a vision encoder for VLMs and other vision tasks. ### Zero-shot Image Classification ```python import torch from PIL import Image import open_clip # Load model and preprocessing model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32') tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/cls-opt-vit-b-32') # Load image image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0) # Define candidate labels labels = ["a dog", "a cat", "a bird"] text = tokenizer(labels) # Run inference with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) # Normalize features image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Calculate similarity similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) # Get predictions values, indices = similarity[0].topk(3) for value, index in zip(values, indices): print(f"{labels[index]}: {value.item():.2%}") ``` ### Image Encoding ```python import torch from PIL import Image import open_clip # Load model model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32') model.eval() # Process image image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0) # Extract features with torch.no_grad(): image_features = model.encode_image(image) print(f"Feature shape: {image_features.shape}") # [1, 512] ``` ## Training Procedure DatologyAI's training pipeline focuses on sophisticated data curation techniques including: 1. **Improved target distribution matching** - Task-specific alignment of image features for classification 2. **Enhanced synthetic data generation** - Optimized caption generation for classification tasks 3. **Predictive metrics for curation quality** - Rapid iteration without full model training The model uses standard CLIP training objectives with no architectural modifications. ## Training Data The model was trained on 13B image-text (multi-epoch) curated from the **DataComp-XL** dataset using DatologyAI's proprietary curation pipeline. The curation process selected high-quality, classification-relevant subsets from the 10B available pairs in DataComp-XL. ## Evaluation Results ### Zero-shot Classification Performance | Benchmark | DatologyAI | SigLIP2 | MetaCLIP | |-----------|------------|---------|----------| | ImageNet1k | **76.91%** | 74.0% | 67.7% | | ImageNetv2 | **70.2%** | 67.1% | 60.4% | ### Training Efficiency - Matches SigLIP2 performance with only **5B samples** (87.5% compute reduction) - Matches MetaCLIP performance with only **1B samples** (92% compute reduction) Full details see [blog post](). ## Model Details - **Developed by:** DatologyAI - **Model type:** CLIP (Contrastive Language-Image Pre-training) - **Architecture:** Vision Transformer B/32 - **License:** Apache 2.0 - **Training framework:** OpenCLIP 2.24.0 ## Technical Specifications ### Model Architecture - **Vision Encoder:** ViT-B/32 (86M parameters) - Patch size: 32×32 - Image size: 224×224 - Embedding dimension: 512 - **Text Encoder:** 12-layer Transformer - Context length: 77 tokens - Vocabulary size: 49,408 (BPE tokenizer) ### Training Configuration - **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6) - **Learning rate:** 5.0e-04 with cosine schedule - **Weight decay:** 0.1 - **Batch size:** 32,768 - **Training samples:** 13B image-text pairs - **Hardware:** Distributed training on H100 GPUs ## Citation If you use this model, please cite: ```bibtex @article{datologyai2025clip, title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only}, author={DatologyAI Team}, journal={DatologyAI Blog}, year={2025}, url={https://datologyai.com/blog/clip-data-upgrade} } ``` ## Additional Information For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade). **Contact:** [team@datologyai.com](mailto:team@datologyai.com) ## Model Card Contact DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com)