You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

vidavox/Qwen3-SKK-32B-DPO

DPO-aligned version of a Qwen3-32B model that was first fine-tuned with SFT (vidavox/Qwen3-SKK-32B-SFT-LoRA) on SKK’s KSMI document data.

This repository contains the **Direct Preference Optimization (DPO)**–tuned model weights.
The training follows the standard DPO setup from TRL, using preference pairs built on top of the SFT model’s generations. :contentReference[oaicite:0]{index=0}


Model Details

  • Base foundation model: Qwen/Qwen3-32B (dense 32B causal LM)
  • SFT stage: vidavox/Qwen3-SKK-32B-SFT-LoRA (instruction tuning on KSMI Q&A).
  • Alignment stage: Direct Preference Optimization (DPO) on preference pairs derived from the SFT model’s outputs.
  • Domain: SKK upstream oil & gas, using KSMI and related SKK regulatory documents as the primary knowledge source.
  • Languages: Primarily Bahasa Indonesia and English, with technical / regulatory style.

The goal of this model is to better match SKK internal preferences (answer style, helpfulness, hallucination trade-offs, etc.) while staying in-distribution with the KSMI domain.


Usage

This checkpoint is a standard causal LM and can be loaded directly with AutoModelForCausalLM.
(If you stored it as a PEFT adapter instead, you can adapt the SFT LoRA loading pattern, but the typical usage is full weights.)

1. Install dependencies

pip install "transformers>=4.43.0" accelerate bitsandbytes

2. Load tokenizer and model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "vidavox/Qwen3-SKK-32B-DPO"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16,   # or "auto"
    trust_remote_code=True,
)
model.eval()

Qwen3 models require trust_remote_code=True to use the official architecture and chat template.

3. Chat-style inference (Qwen3 chat template)

messages = [
    {
        "role": "system",
        "content": "You are an assistant specialized in SKK KSMI documents.",
    },
    {
        "role": "user",
        "content": "Jelaskan secara ringkas tahapan proses persetujuan POD berdasarkan KSMI.",
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt=True,
    enable_thinking = False, # Disable thinking
)

model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=max_new_tokens,
    do_sample=False,
    temperature=0.1,
    top_p=0.95,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)

4. Optional: 4-bit loading for constrained VRAM

from transformers import BitsAndBytesConfig

MODEL_ID = "vidavox/Qwen3-SKK-32B-DPO"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True,
)
model.eval()

Generation code is the same as in section 3.


Training Data (Preference Stage)

The DPO stage uses preference pairs derived from:

  • Prompts based on SKK’s KSMI document data (same domain as the SFT model).
  • For each prompt, at least two candidate responses (e.g. SFT model outputs + weaker alternatives).
  • A preference signal indicating chosen vs rejected answer, produced by an SDA-style evaluation / labeling pipeline.

Key characteristics:

  • Domain-specific, technical regulatory Q&A in Indonesian and English.
  • Focus on:
    • better adherence to the underlying KSMI content,
    • improved helpfulness and structure,
    • discouraging clearly worse / unfaithful generations.
  • Dataset is private and not released with the model.

Before DPO, the model is already instruction-tuned via SFT on KSMI data; DPO then optimizes it to respect relative preferences between answers, following the DPO formulation.


Evaluation (SDA on 50-sample test set)

As with the SFT model, evaluation is performed on a 50-example test set using an SDA-style pipeline:

  • Automatic overlap / semantic metrics:
    • BERTScore (precision, recall, F1)
  • Human-oriented quality metrics (1–10 Likert scale):
    • correctness, completeness, factuality, structure,
    • hallucination_resistance
  • Perplexity (LM confidence / fluency).

Below are the mean scores for the DPO model:

Metric Mean (test) Scale / note
BERTScore F1 0.840 0–1, higher = better semantic similarity
Correctness 4.88 1–10, higher = more logically correct answers
Completeness 4.14 1–10, higher = more required information covered
Factuality 6.28 1–10, higher = fewer factual errors
Structure 7.00 1–10, higher = better organization / formatting
Hallucination resistance 6.34 1–10, higher = less hallucination

Perplexity computation for this run produced an infinite / undefined mean (likely due to numerical issues on a small sample). It is therefore not used as a primary comparison metric.

Relationship to the SFT model

On this particular 50-example test set, the DPO-aligned model’s automatic and SDA metrics are slightly lower than those of the SFT model, indicating that:

  • This DPO stage is primarily targeting preference alignment (i.e., relative quality of outputs under SKK’s internal criteria), and
  • Performance trade-offs on small held-out sets should be interpreted cautiously; larger, task-specific evaluations are recommended when deciding between SFT vs DPO checkpoints.

(Exact SFT metrics can be found in the model card for vidavox/Qwen3-SKK-32B-SFT-LoRA.)


Intended Use (High-Level)

  • Primary use:

    • Internal SKK assistant systems where subjective preferences (helpfulness, tone, avoidance of clearly bad outputs) matter.
    • Question answering grounded in KSMI and related upstream O&G regulations.
  • Not intended for:

    • General-purpose open-domain chat without additional validation.
    • Safety-critical applications (medical, legal, financial, etc.).
    • Use outside the SKK / KSMI domain without further alignment and evaluation.

As with any DPO-aligned model, outputs may reflect the biases and preferences present in the underlying preference data. Careful review is recommended before deploying in downstream systems.


References

  • DPO paper: Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023.

  • TRL library: Hugging Face TRL – post-training methods including SFT, PPO, GRPO, and DPO.

  • Qwen3 base model: Qwen/Qwen3-32B and Qwen3 technical report.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vidavox/Qwen3-SKK-32B-DPO

Base model

Qwen/Qwen3-32B
Finetuned
(160)
this model