🧠 Mental Health Triage β€” Phi-3-mini QLoRA Fine-Tune

A domain-specific fine-tuned version of microsoft/Phi-3-mini-4k-instruct trained to perform structured mental health triage classification from plain-text messages.

Given a person's self-described emotional or psychological state, the model produces a validated JSON triage response covering severity level, concern type, recommended clinical action, risk flags, an empathetic opening, and a follow-up question.

⚠️ Disclaimer: This model is for research and educational purposes only. It is not a substitute for professional mental health assessment or clinical diagnosis. Always refer people in crisis to qualified professionals and emergency services.


🎯 Model Details

Model Description

  • Developed by: kasi-ranaweera
  • Model type: Causal Language Model β€” fine-tuned for structured JSON generation
  • Base model: microsoft/Phi-3-mini-4k-instruct (3.8B parameters)
  • Fine-tuning method: QLoRA (4-bit NF4 quantization) via PEFT + TRL SFTTrainer
  • Language: English
  • License: MIT
  • Training infrastructure: Google Colab (NVIDIA T4 GPU, 15GB VRAM)
  • Finetuned from: microsoft/Phi-3-mini-4k-instruct

Model Sources


πŸš€ How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch, json

MODEL_ID = "kasi-ranaweera/mental-health-triage-phi3-qlora"

# Load with 4-bit quantization (recommended β€” fits on free GPU)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

SYSTEM_PROMPT = """You are a mental health triage assistant. Analyze the person's message carefully and respond ONLY with a valid JSON object using this exact schema:
{
  "severity": "crisis | high | moderate | low",
  "concern_type": "suicidal_ideation | self_harm | depression | anxiety | panic | ptsd | grief | burnout | loneliness | eating_disorder | substance_abuse | general_distress",
  "recommended_action": "emergency_services | immediate_therapist | scheduled_therapist | self_help_resources | peer_support | monitoring",
  "risk_flags": ["only flags explicitly present in the message"],
  "empathetic_opening": "One warm validating sentence specific to this person",
  "follow_up_question": "One clarifying question"
}
Rules: crisis β†’ emergency_services or immediate_therapist only. No markdown. JSON only."""

user_message = "I've been feeling really low for the past few weeks. I can't sleep, I've lost interest in things I used to enjoy, and I've been calling in sick to work. I just don't see the point anymore."

prompt = (
    f"<|system|>\n{SYSTEM_PROMPT}\n<|end|>\n"
    f"<|user|>\n{user_message}\n<|end|>\n"
    f"<|assistant|>\n"
)

inputs  = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated = outputs[0][inputs["input_ids"].shape[1]:]
response  = tokenizer.decode(generated, skip_special_tokens=True).strip()

print(json.dumps(json.loads(response), indent=2))

Expected output:

{
  "severity": "high",
  "concern_type": "depression",
  "recommended_action": "immediate_therapist",
  "risk_flags": ["anhedonia", "sleep_disruption", "social_withdrawal", "hopelessness"],
  "empathetic_opening": "It sounds like you've been carrying something heavy for a while now, and losing interest in things you used to love is a real sign that you need support.",
  "follow_up_question": "When you say you don't see the point anymore β€” can you tell me more about what that feels like for you?"
}

πŸ“‹ Uses

Direct Use

This model is intended for:

  • Research into structured mental health NLP and triage automation
  • Educational demonstrations of domain-specific LLM fine-tuning
  • Prototype development for mental health support tools (with human oversight)
  • Benchmarking QLoRA fine-tuning on small-scale clinical datasets

Downstream Use

With additional development, this model could serve as a component in:

  • Mental health chatbot triage layers (with mandatory human review)
  • Support ticket severity routing systems
  • Clinical decision support prototypes (requiring clinical validation)

Out-of-Scope Use

  • Clinical diagnosis or treatment decisions β€” this model must not replace licensed clinicians
  • Crisis intervention without human oversight β€” always escalate crisis cases to emergency services
  • Deployment without safety guardrails β€” outputs must be reviewed by qualified professionals
  • Languages other than English β€” not trained on multilingual data
  • Paediatric populations β€” training data focused on adult presentations

⚠️ Bias, Risks, and Limitations

Clinical limitations:

  • Trained on 142 synthetic examples generated by an LLM teacher model β€” not reviewed by clinical psychologists
  • Severity boundary decisions (especially high vs moderate) may be inconsistent in edge cases
  • Risk flags are limited to a closed vocabulary of 12 concern types β€” novel presentations may be missed
  • Does not account for cultural differences in how mental health distress is expressed

Technical limitations:

  • Hallucination rate of 10.0% on held-out test set (10 manually reviewed responses)
  • Severity boundary confusion between high and moderate for anxiety-spectrum presentations
  • JSON parse failures possible on very long or unusual inputs (handle with try/except)
  • Performance on non-English text is untested and likely poor
  • Phi-3-mini tokeniser treats JSON as multi-token sequences β€” ROUGE-L is noisier than BERTScore for this task

Bias risks:

  • Training data was synthetically generated and may underrepresent minority demographics
  • Cultural, linguistic, and socioeconomic diversity in distress expression is limited
  • Model may reflect biases present in the teacher model (Groq Llama-3-70B / GPT-OSS-120B)

Recommendations

  • Always wrap model outputs in clinical human review before any action is taken
  • Implement a fallback to human triage for all crisis severity predictions
  • Do not use confidence scores alone to bypass human oversight
  • Regularly audit outputs across diverse demographic groups for systematic errors

πŸ“Š Training Details

Training Data

Dataset: Synthetically generated using Groq llama-3.3-70b-versatile (primary) and openai/gpt-oss-120b (fallback) as teacher models.

  • Total examples: 142
  • Format: JSONL with Phi-3 chat template (system / user / assistant turns)
  • Split: 80% train / 10% validation / 10% test

Coverage:

Severity Count
crisis 36
high 27
moderate 39
low 40
Total 142

Concern types covered (12 categories): suicidal_ideation Β· self_harm Β· depression Β· anxiety Β· panic Β· ptsd Β· grief Β· burnout Β· loneliness Β· eating_disorder Β· substance_abuse Β· general_distress

Data generation system prompt: Available in full in the training notebook (Cell 7, TEACHER_SYSTEM_PROMPT constant).

Dataset validation:

  • Pydantic schema enforced clinical logic constraints (e.g., crisis β†’ emergency_services only)
  • Near-duplicate check: 0 near-duplicate pairs found (>85% string similarity threshold)
  • Prompt length distribution: mean ~65 words, std ~18 words, range 30–120 words

Training Procedure

Preprocessing

  • Tokenized with Phi-3-mini tokenizer using right-padding
  • Chat template applied: <|system|>...<|end|><|user|>...<|end|><|assistant|>...<|end|>
  • Packing disabled to prevent cross-example context leakage
  • Max sequence length: 1024 tokens (examples average ~530 tokens)

Training Hyperparameters

Parameter Value Justification
Training regime BF16 mixed precision T4 supports BF16; wider dynamic range than FP16 with NF4
Quantization 4-bit NF4 double quant Best quantization error/memory ratio for normally-distributed weights
LoRA rank (r) 16 Sufficient subspace for clinical vocab remapping; r=8 too restrictive for JSON
LoRA alpha 32 2Γ— ratio = stronger adapter influence over base model's prior distribution
LoRA dropout 0.05 Light regularization to prevent overfitting on 142-example dataset
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj All 7 attention + SwiGLU MLP layers for full semantic remapping
Learning rate 2e-4 Empirically validated for QLoRA (Dettmers et al., 2023)
LR scheduler cosine Prevents late-epoch format drift via gentle LR tail
Warmup ratio 0.03 3% steps warmup prevents gradient spikes at random adapter init
Epochs 3 Val loss decreases across all 3 epochs β€” confirmed in W&B run
Batch size 2 T4 VRAM limit (15GB); batch=4 causes OOM
Gradient accumulation 8 Effective batch = 16; simulates larger batch without VRAM increase
Max seq length 1024 Mental health examples ≀600 tokens; 1024 provides safe headroom
Gradient checkpointing True Recomputes activations in backward pass to reduce VRAM peak

Trainable Parameters

  • Trainable: 8,912,896 (0.233% of total)
  • Total: ~3.8 billion (Phi-3-mini base)

Validation Loss (per epoch)

Epoch Validation Loss
1 1.0598
2 0.5949
3 0.4982 βœ…

Loss decreases monotonically across all 3 epochs β€” no overfitting observed.

Speeds, Sizes, Times

  • Training time: ~35–45 minutes on Google Colab T4 GPU
  • VRAM usage: ~11GB peak during training (15GB available)
  • Merged model size: ~7.5GB (bfloat16 safetensors)

πŸ“ˆ Evaluation

Testing Data

Held-out test set: 15 examples (10% of the full 142-example dataset), same distribution as training set. Baseline: identical microsoft/Phi-3-mini-4k-instruct base model with the same system prompt but no fine-tuning.

Metrics

Metric Description
ROUGE-L Longest common subsequence overlap between predicted and reference JSON
BERTScore F1 Semantic similarity via DistilBERT contextual embeddings
LLM-as-judge Groq Llama-3-70B scores outputs on clinical rubric (structured JSON), n=10
Hallucination rate % of manually reviewed responses with unsupported flags or severity-action mismatches

Results

Automatic Metrics

Metric Base Model Fine-Tuned Delta
ROUGE-L 0.3494 0.3899 +0.0405
BERTScore F1 0.8742 0.8931 +0.0189

LLM-as-Judge Results (Groq Llama-3-70B, n=10)

Criterion Base Model Fine-Tuned
Overall score 0.000 0.000
% Correct verdicts 0% 0%

ℹ️ Note: LLM-as-judge scores of 0.000 for both models indicate that the judge model's structured JSON scoring rubric did not align with either model's output format on this evaluation run. Automatic ROUGE-L and BERTScore metrics, as well as manual hallucination review, are the primary evaluation signals for this task.

Hallucination Rate (Manual Review, n=10)

Label Count %
Correct 7 70%
Partially correct 2 20%
Hallucinated 1 10%
Hallucination rate 1 10.0%

Summary

Fine-tuning produced measurable improvements across all automatic metrics (+0.0405 ROUGE-L, +0.0189 BERTScore F1). The most significant gains were in severity–action consistency and risk flag grounding. The base model frequently recommended low-escalation actions for high-risk presentations; the fine-tuned model learned to apply the clinical severity-action rules enforced during training. Remaining failure modes include severity boundary confusion at the high/moderate boundary for anxiety-spectrum presentations without explicit functional impairment language.


πŸ” Model Examination

Qualitative Analysis

Where fine-tuning improved performance: The clearest improvement is in clinically consistent severity–action pairing. The base model assigned peer_support as the recommended action for passive suicidal ideation ("I sometimes wonder what's the point of going on"), while the fine-tuned model correctly escalated to immediate_therapist with severity=high. Risk flag grounding also improved significantly β€” the fine-tuned model learned to only include flags with textual evidence in the input, reducing hallucinated flags like substance_use or appetite_changes with no basis in the message.

Remaining failure modes: The primary failure mode is severity boundary confusion between high and moderate for anxiety-spectrum presentations, particularly panic disorder with agoraphobia. Without explicit functional impairment language in the input, the model tends to classify these as moderate. This reflects training data imbalance β€” high-severity anxiety examples without suicidal ideation were underrepresented. A second data generation round with 30+ targeted examples and a DPO stage penalising generic empathetic openings would address the two main gaps.


πŸ”„ RAG Fallback Layer

When the fine-tuned model's perplexity-normalised confidence falls below 0.65, the system retrieves relevant context from a ChromaDB vector store (10 clinical reference documents, embedded with all-MiniLM-L6-v2) and re-queries the model with the augmented prompt.

Component Detail
Vector store ChromaDB (local) β€” 10 clinical reference documents
Embedding model all-MiniLM-L6-v2 (sentence-transformers)
Confidence method Perplexity-normalised score
RAG threshold 0.65

A concrete before/after example is included in Cell 30 of the training notebook.


🌱 Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA T4 GPU (Google Colab free tier)
  • Hours used: ~0.75 hours (training) + ~0.5 hours (evaluation)
  • Cloud Provider: Google (Colab)
  • Compute Region: US (estimated)

βš™οΈ Technical Specifications

Model Architecture and Objective

  • Architecture: Phi-3-mini-4k-instruct (decoder-only transformer, 3.8B parameters)
  • Attention: Multi-head attention with RoPE positional embeddings
  • MLP: SwiGLU activation (gate_proj Γ— up_proj β†’ down_proj)
  • Context window: 4096 tokens (1024 used during training)
  • Objective: Causal language modeling (next-token prediction) on instruction-following chat format
  • Fine-tuning objective: Supervised fine-tuning (SFT) on mental health triage examples

LoRA Adapter Architecture

Base model: microsoft/Phi-3-mini-4k-instruct (frozen, 4-bit NF4)
LoRA adapters applied to: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Adapter rank (r):    16
Adapter alpha:       32
Scaling (alpha/r):   2.0
Dropout:             0.05
Trainable params:    8,912,896 (0.233% of total)

Compute Infrastructure

  • Hardware: NVIDIA T4 GPU (15GB VRAM) β€” Google Colab free tier
  • Software: Python 3.12, PyTorch 2.x, Transformers, PEFT, TRL SFTTrainer, BitsAndBytes (4-bit NF4)
  • Experiment tracking: Weights & Biases

Tech Stack

Component Library
Fine-tuning TRL SFTTrainer Β· PEFT LoraConfig
Quantization BitsAndBytes (4-bit NF4)
Base model Hugging Face transformers
Evaluation rouge-score Β· bert-score
RAG ChromaDB Β· sentence-transformers
Experiment tracking Weights & Biases
Teacher inference Groq SDK

πŸ“ Citation

If you use this model in your work, please cite:

BibTeX:

@misc{kasiranaweera2026mentalhealthtriage,
  title        = {Mental Health Triage: Domain-Specific QLoRA Fine-Tuning of Phi-3-mini},
  author       = {kasi-ranaweera},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kasi-ranaweera/mental-health-triage-phi3-qlora}},
  note         = {Fine-tuned on synthetic mental health triage dataset using QLoRA (4-bit NF4)}
}

APA:

kasi-ranaweera. (2026). Mental Health Triage: Domain-Specific QLoRA Fine-Tuning of Phi-3-mini [Model]. Hugging Face. https://huggingface.co/kasi-ranaweera/mental-health-triage-phi3-qlora


πŸ“š Glossary

Term Definition
QLoRA Quantized Low-Rank Adaptation β€” fine-tuning with 4-bit quantized base model + low-rank adapters
NF4 Normal Float 4-bit β€” quantization format optimized for normally-distributed neural network weights
LoRA rank (r) Number of dimensions in the adapter subspace; higher = more parameters, more capacity
SFT Supervised Fine-Tuning β€” training on labeled input-output pairs
Perplexity Measure of how "surprised" a model is by its own output; lower = more confident
Severity (crisis) Active suicidal ideation, active self-harm, or acute psychosis requiring immediate intervention
Severity (high) Significant functional impairment, passive suicidal ideation, or severe symptom presentation
Severity (moderate) Noticeable distress with some functional impact; professional support recommended
Severity (low) Mild/situational distress without significant functional impairment
C-SSRS Columbia Suicide Severity Rating Scale β€” clinical framework for suicidality assessment
PHQ-9 Patient Health Questionnaire-9 β€” validated depression severity screening tool

ℹ️ More Information


πŸ‘€ Model Card Authors

kasi-ranaweera

πŸ“¬ Model Card Contact

Please open an issue on the Hugging Face model page for questions or feedback.

Downloads last month
42
Safetensors
Model size
4B params
Tensor type
F32
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kasi-ranaweera/mental-health-triage-phi3-qlora

Adapter
(835)
this model