vidavox/Qwen3-SKK-32B-DPO
DPO-aligned version of a Qwen3-32B model that was first fine-tuned with SFT (vidavox/Qwen3-SKK-32B-SFT-LoRA) on SKK’s KSMI document data.
This repository contains the **Direct Preference Optimization (DPO)**–tuned model weights.
The training follows the standard DPO setup from TRL, using preference pairs built on top of the SFT model’s generations. :contentReference[oaicite:0]{index=0}
Model Details
- Base foundation model:
Qwen/Qwen3-32B(dense 32B causal LM) - SFT stage:
vidavox/Qwen3-SKK-32B-SFT-LoRA(instruction tuning on KSMI Q&A). - Alignment stage: Direct Preference Optimization (DPO) on preference pairs derived from the SFT model’s outputs.
- Domain: SKK upstream oil & gas, using KSMI and related SKK regulatory documents as the primary knowledge source.
- Languages: Primarily Bahasa Indonesia and English, with technical / regulatory style.
The goal of this model is to better match SKK internal preferences (answer style, helpfulness, hallucination trade-offs, etc.) while staying in-distribution with the KSMI domain.
Usage
This checkpoint is a standard causal LM and can be loaded directly with AutoModelForCausalLM.
(If you stored it as a PEFT adapter instead, you can adapt the SFT LoRA loading pattern, but the typical usage is full weights.)
1. Install dependencies
pip install "transformers>=4.43.0" accelerate bitsandbytes
2. Load tokenizer and model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "vidavox/Qwen3-SKK-32B-DPO"
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16, # or "auto"
trust_remote_code=True,
)
model.eval()
Qwen3 models require
trust_remote_code=Trueto use the official architecture and chat template.
3. Chat-style inference (Qwen3 chat template)
messages = [
{
"role": "system",
"content": "You are an assistant specialized in SKK KSMI documents.",
},
{
"role": "user",
"content": "Jelaskan secara ringkas tahapan proses persetujuan POD berdasarkan KSMI.",
},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt=True,
enable_thinking = False, # Disable thinking
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=0.1,
top_p=0.95,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)
4. Optional: 4-bit loading for constrained VRAM
from transformers import BitsAndBytesConfig
MODEL_ID = "vidavox/Qwen3-SKK-32B-DPO"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
)
model.eval()
Generation code is the same as in section 3.
Training Data (Preference Stage)
The DPO stage uses preference pairs derived from:
- Prompts based on SKK’s KSMI document data (same domain as the SFT model).
- For each prompt, at least two candidate responses (e.g. SFT model outputs + weaker alternatives).
- A preference signal indicating chosen vs rejected answer, produced by an SDA-style evaluation / labeling pipeline.
Key characteristics:
- Domain-specific, technical regulatory Q&A in Indonesian and English.
- Focus on:
- better adherence to the underlying KSMI content,
- improved helpfulness and structure,
- discouraging clearly worse / unfaithful generations.
- Dataset is private and not released with the model.
Before DPO, the model is already instruction-tuned via SFT on KSMI data; DPO then optimizes it to respect relative preferences between answers, following the DPO formulation.
Evaluation (SDA on 50-sample test set)
As with the SFT model, evaluation is performed on a 50-example test set using an SDA-style pipeline:
- Automatic overlap / semantic metrics:
- BERTScore (precision, recall, F1)
- Human-oriented quality metrics (1–10 Likert scale):
- correctness, completeness, factuality, structure,
- hallucination_resistance
- Perplexity (LM confidence / fluency).
Below are the mean scores for the DPO model:
| Metric | Mean (test) | Scale / note |
|---|---|---|
| BERTScore F1 | 0.840 | 0–1, higher = better semantic similarity |
| Correctness | 4.88 | 1–10, higher = more logically correct answers |
| Completeness | 4.14 | 1–10, higher = more required information covered |
| Factuality | 6.28 | 1–10, higher = fewer factual errors |
| Structure | 7.00 | 1–10, higher = better organization / formatting |
| Hallucination resistance | 6.34 | 1–10, higher = less hallucination |
Perplexity computation for this run produced an infinite / undefined mean (likely due to numerical issues on a small sample). It is therefore not used as a primary comparison metric.
Relationship to the SFT model
On this particular 50-example test set, the DPO-aligned model’s automatic and SDA metrics are slightly lower than those of the SFT model, indicating that:
- This DPO stage is primarily targeting preference alignment (i.e., relative quality of outputs under SKK’s internal criteria), and
- Performance trade-offs on small held-out sets should be interpreted cautiously; larger, task-specific evaluations are recommended when deciding between SFT vs DPO checkpoints.
(Exact SFT metrics can be found in the model card for vidavox/Qwen3-SKK-32B-SFT-LoRA.)
Intended Use (High-Level)
Primary use:
- Internal SKK assistant systems where subjective preferences (helpfulness, tone, avoidance of clearly bad outputs) matter.
- Question answering grounded in KSMI and related upstream O&G regulations.
Not intended for:
- General-purpose open-domain chat without additional validation.
- Safety-critical applications (medical, legal, financial, etc.).
- Use outside the SKK / KSMI domain without further alignment and evaluation.
As with any DPO-aligned model, outputs may reflect the biases and preferences present in the underlying preference data. Careful review is recommended before deploying in downstream systems.
References
DPO paper: Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023.
TRL library: Hugging Face TRL – post-training methods including SFT, PPO, GRPO, and DPO.
Qwen3 base model: Qwen/Qwen3-32B and Qwen3 technical report.
Model tree for vidavox/Qwen3-SKK-32B-DPO
Base model
Qwen/Qwen3-32B