File size: 2,240 Bytes
1aafaea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
base_model:
- Vdr1/L3-8B-Sunfall-v0.4-Lunar-Stheno
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
library_name: transformers
tags:
- mergekit
- merge
---
# merge
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
## Merge Details
### Merge Method
This model was merged using the [SLERP](https://en.wikipedia.org/wiki/Slerp) merge method.
### Models Merged
The following models were included in the merge:
* [Vdr1/L3-8B-Sunfall-v0.4-Lunar-Stheno](https://huggingface.co/Vdr1/L3-8B-Sunfall-v0.4-Lunar-Stheno)
* [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
### Configuration
The following YAML configuration was used to produce this model:
```yaml
# --- Merge Configuration: Linear Ramp with Mild Head/Input Blend ---
models:
- model: Vdr1/L3-8B-Sunfall-v0.4-Lunar-Stheno
- model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
merge_method: slerp
base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
dtype: bfloat16
# t-schedule per layer: small Sunfall on input/head, ramping up linearly,
# plateau at 0.8 through the mid-network, then ramp down slightly at the end.
parameters:
t: [0.2, 0.4, 0.6, 0.8, 0.8, 0.6, 0.4, 0.2, 0.2]
# — Interpretation by layer:
# [Embedding, Block1, Block2, Block3, Block4, Block5, Block6, Block7, LM Head]
# 0.2 Sunfall ↑ ramp ↑ plateau ↓ ramp 0.2 Sunfall
# Notes:
# 1. Using a slightly nonzero t=0.2 at both the embedding layer and LM head
# helps align scales and prevents representational drift when Sunfall’s
# mid-network activations become stronger.
# 2. The linear ramp (0.2 → 0.8) avoids any sudden “hard switch” that can cause
# instability or hallucination bubbles in a single block.
# 3. The plateau of t=0.8 across four central blocks emphasizes Sunfall’s style
# without giving it 100% in any one layer—this typically yields more coherent
# blending than a full t=1.0 peak.
# 4. You can experiment with extending or contracting the plateau (e.g. 3 vs. 5
# central blocks) based on your qualitative outputs.
# “Hermes for input & output” remains in effect by using t=0.2 at both ends.
```
|