Pinkstack
/

DistilGPT-OSS-qwen3-4B

@@ -10,12 +10,15 @@ tags:
 - conversational
 - distillation
 - math
 ---
 This is the bf16 safetensors variant
 ![Distil gpt oss logo](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/RxoOFH7vZmoyyKOUlB_oX.png)
 # What it is
-DistilGPT-OSS-qwen3-4B is a Qwen3 4B-2507 thinking fine tune, it supports up to **256K** tokens of input and output (aka total context) and can think for up to **65536** tokens when set to **high** reasoning effort. unlike the original qwen3, this model was fine-tuned on GPT-OSS reasoning outputs (unlike Deepseek r1 outputs which qwen3 was probably fine-tuned on for advanced reasoning). By fine-tuning on GPT-OSS outputs, the model was able to learn how to think efficiently, follow instructions better, and the new ability to think with a certain effort based on how much you want it to think.
 ⚠️This model is NOT as censored as the original GPT-OSS, we focused on performance rather than censorship. The model is still safety trained, it would just allow for more *"creative"* prompts, unlike GPT-OSS. We are not responsible for what the model generates.
@@ -33,17 +36,18 @@ Benefits of using this model over standard qwen3 4b thinking:
 DistilGPT-OSS-qwen3-4B should be used for the following:
-- Local on device efficient assistance
-- Code generation
-- Summary generation
-- General use
 Or anything else
 ❌⚠️ It should ABSOLUTELY **not** be used for:
-- High-risk workspaces
-- Medical questions
 - Anything high risk which requires 1:1 accuracy.
 It is a small model thus general knowledge is limited to its size.
@@ -85,6 +89,21 @@ As you can see, you set the reasoning effort via the system prompt. We recommend
 As you can see, based on the reasoning effort of the model and your prompt, the model would think for a different amount of time.
 Keep in mind, these tests were done in LM Studio, GGUF q8_0 on a single consumer card (rtx 3080) where we got 95 - 80 Tokens/Second on 8192 context.
 # Additional information
 The model was trained using unsloth, using a mix of private datasets and public datasets.

 - conversational
 - distillation
 - math
+language:
+- en
+library_name: transformers
 ---
 This is the bf16 safetensors variant
 ![Distil gpt oss logo](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/RxoOFH7vZmoyyKOUlB_oX.png)
 # What it is
+DistilGPT-OSS-qwen3-4B is a Qwen3 4B-2507 thinking fine tune, it supports up to **262K** tokens of input and output (aka total context) and can think for up to **65536** tokens when set to **high** reasoning effort. unlike the original qwen3, this model was fine-tuned on GPT-OSS reasoning outputs (unlike Deepseek r1 outputs which qwen3 was probably fine-tuned on for advanced reasoning). By fine-tuning on GPT-OSS outputs, the model was able to learn how to think efficiently, follow instructions better, and the new ability to think with a certain effort based on how much you want it to think.
 ⚠️This model is NOT as censored as the original GPT-OSS, we focused on performance rather than censorship. The model is still safety trained, it would just allow for more *"creative"* prompts, unlike GPT-OSS. We are not responsible for what the model generates.
 DistilGPT-OSS-qwen3-4B should be used for the following:
+- Local on device efficient assistance.
+- Code generation.
+- Math generation.
+- Summary generation.
+- General day to day use.
 Or anything else
 ❌⚠️ It should ABSOLUTELY **not** be used for:
+- Anything law related due to hallucinations.
+- Medical questions.
 - Anything high risk which requires 1:1 accuracy.
 It is a small model thus general knowledge is limited to its size.
 As you can see, based on the reasoning effort of the model and your prompt, the model would think for a different amount of time.
 Keep in mind, these tests were done in LM Studio, GGUF q8_0 on a single consumer card (rtx 3080) where we got 95 - 80 Tokens/Second on 8192 context.
+# How it was done
+We first started with some public datasets, removed almost all "I am sorry but.." for the dataset, filtered and skipped the first 25k samples, then mixed in outputs from the big 120B GPT OSS when we saw that the model was not as good at certain things.
+After doing that we formatted it into the proper qwen3 format, and did a few test runs using different optimizers, configurations etc. Keep in mind, we trained on about 15K samples, with each sample having 3 turns (the entire dataset was multi turn), the ademamix optimizer was chosen.
+We did a few test runs to see if it even learns anything, what it learns etc. We had runs where it was very censored, runs where it looped, and this one was the **best**. In addition, we added some outputs generated using the 120B gpt oss by us, to improve performance. The simplest way to explain the perforomance is like this:
+- Imagine the biggest GPT OSS (120B) is like GPT 5.
+- The official smallest GPT OSS (20B) is like GPT 5 mini.
+- And this one is like GPT-5 Nano.
+Obviously, no, these models do not compare to closed-sourced OpenAI models, but this comparison is just to explain it simply.
+This is how these models should be used, the biggest GPT OSS for the hard complicated tasks, the smaller 20B for average tasks and our "Open weights GPT 5 nano equivalent" for easier day-day tasks. (as a reminder it does NOT have the same performance as GPT 5 nano. Not even close to it.)
 # Additional information
 The model was trained using unsloth, using a mix of private datasets and public datasets.