---
license: apache-2.0
tags:
- rag
- clara
- qwen
base_model: Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
---

# CLaRa: Continuous Latent Reasoning Agent

This is a trained CLaRa model based on Qwen/Qwen3-4B-Instruct-2507.

## Model Description

CLaRa (Bridging Retrieval and Generation with Continuous Latent Reasoning) is a system that unifies retrieval and generation into a shared continuous space.

## Usage

Since this model relies on custom modeling code, you must use `trust_remote_code=True`.

```python
from transformers import AutoModel, AutoTokenizer
import torch

model_id = "bagpipejerry/clara-qwen3-4b"

# Load model
# Note: device_map="auto" is recommended for large models
model = AutoModel.from_pretrained(model_id, trust_remote_code=True, device_map="auto")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Example 1: Single Document QA
question = "What is CLaRa?"
documents = [["CLaRa is a framework that bridges retrieval and generation with continuous latent reasoning."]]

outputs = model.generate_from_text(
    questions=[question], 
    documents=documents,
    tokenizer=tokenizer,
    max_new_tokens=100
)
print(outputs[0])

# Example 2: Multi-Document Retrieval & Reranking (Stage 3 capability)
print("\n" + "="*60)
print("📝 Example 2: Multi-Document Retrieval & Reranking")
print("="*60)

test_questions_multi = ["How does CLaRa compress documents?"]
test_documents_multi = [[
    "CLaRa uses a compression pretraining approach called KPCP framework with QA pairs and paraphrases.",
    "The compressor is trained to compress documents into continuous latent representations while retaining key semantics.",
    "CLaRa achieves 32x-64x compression rates through its three-stage training pipeline.",
    "The framework uses DeepSpeed ZeRO-2 for distributed training across multiple GPUs.",
]]

outputs_multi, topk_indices = model.generate_from_questions(
    questions=test_questions_multi,
    documents=test_documents_multi,
    tokenizer=tokenizer,
    max_new_tokens=100,
    temperature=0.7,
)

print(f"Question: {test_questions_multi[0]}")
print(f"Candidate Documents: {len(test_documents_multi[0])} docs")
print(f"Top-K Selected Indices: {topk_indices[0].tolist() if torch.is_tensor(topk_indices[0]) else topk_indices[0]}")
print(f"CLaRa Response: {outputs_multi[0]}")
```

## Citation

If you use this model, please cite the CLaRa paper/project.