testing smol-IQ4-KSS

#7
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

smol-IQ4-KSS:

Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: CPU buffer size = 42986.73 MiB
llm_load_tensors: CPU buffer size = 43907.32 MiB
llm_load_tensors: CPU buffer size = 44553.77 MiB
llm_load_tensors: CPU buffer size = 44013.15 MiB
llm_load_tensors: CPU buffer size = 43907.32 MiB
llm_load_tensors: CPU buffer size = 44638.58 MiB
llm_load_tensors: CPU buffer size = 43907.32 MiB
llm_load_tensors: CPU buffer size = 44553.77 MiB
llm_load_tensors: CPU buffer size = 44013.15 MiB
llm_load_tensors: CPU buffer size = 43907.32 MiB
llm_load_tensors: CPU buffer size = 39672.03 MiB
llm_load_tensors: CPU buffer size = 690.75 MiB
llm_load_tensors: CUDA0 buffer size = 14702.37 MiB
....................................................................................................

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

llama_new_context_with_model: n_ctx = 120064
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 600000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 15242.53 MiB
llama_new_context_with_model: KV self size = 15242.50 MiB, K (q6_0): 7621.25 MiB, V (q6_0): 7621.25 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 323.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 245.26 MiB
llama_new_context_with_model: graph nodes = 3445
llama_new_context_with_model: graph splits = 154
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op OUT_PROD to OFF
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 120064, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 9.736 52.59 14.807 8.64
512 128 512 10.209 50.15 15.452 8.28
512 128 1024 11.201 45.71 12.787 10.01
512 128 1536 10.775 47.52 13.065 9.80
512 128 2048 10.509 48.72 13.105 9.77
512 128 2560 9.364 54.68 15.037 8.51
512 128 3072 10.550 48.53 11.814 10.83
512 128 3584 9.727 52.64 11.843 10.81
512 128 4096 9.015 56.80 13.292 9.63
512 128 4608 11.188 45.76 14.481 8.84

Hi I see your TG is not consistent since I have the same CPU I can tell that you oversubscribed on the threads. 101 is too much for qyfs. Try a bit lower until the TG is stable

iq4K-SS fits in 512 gigs of ram 96% full and 48 gb vram 16 and 12 gigs on the 2 cards, all layers offloaded 32k context - probably can fit into 1x24 gig gpu with q4_0

Edit - Added pokedex showcase seems better quality vs q2k model as expected in terms of aesthetic design and layout
Screenshot from 2025-10-21 13-23-17
Screenshot from 2025-10-21 13-23-28

Prompt

  • Tokens: 384
  • Time: 11047.376 ms
  • Speed: 34.8 t/s
    Generation
  • Tokens: 146
  • Time: 17356.476 ms
  • Speed: 8.4 t/s

numactl -N 0 -m 0
../build/bin/llama-server
--model "Ling-1T-smol-IQ4_KSS-00001-of-00011.gguf"
--alias ubergarm/Ling-1T-GGUF
-ctk q8_0 -ctv q8_0
--ctx-size 32768
-fa -fmoe
-ub 4096 -b 4096
-ngl 99
--split-mode layer --tensor-split 1,1
-ot exps=CPU
--parallel 1
--threads 90
--host 127.0.0.1
--port 8080
--mirostat 2 --mirostat-ent 5
--mirostat-lr 0.1 \

This version can still not draw svg like the iQ2K something about the position is broken
but is a logic/stem wizard. I don't have a local model that could solve the below and this did in 2600 tokens

Joke: (first model to not tell me a joke about atoms)
A man walks into a library and asks the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?"
The librarian replies, "It rings a bell, but I'm not sure if it's here or not."

(Verification: This joke relies on a double pun—Pavlov's conditioned reflex ("rings a bell") and Schrödinger's superposition ("not sure if it's here or not"). The humor is structurally sound, linguistically coherent, and aligns with common joke patterns. Cross-checked against joke databases and humor theory: the setup establishes expectation, the punchline subverts it via scientific references, and the wordplay is logically consistent. No logical gaps detected.)

Problem:
I would like you to solve this puzzle:
37 # 21 = 31FD44
77 # 44 = 3AFE1B1
123 # 17 = E2B3D24
71 # 6 = ?

profile
ubergarm/Ling-1T-GGUF
We are given a puzzle with a custom operation # that maps two numbers to a hexadecimal-like string:

37 # 21 = 31FD44
77 # 44 = 3AFE1B1
123 # 17 = E2B3D24
71 # 6 = ?
Let’s analyze the pattern.

Step 1: Look at inputs and outputs
All outputs are hexadecimal strings (digits 0-9, A-F), so likely the result is a number converted to hex.

Let’s denote:

A # B = some function f(A, B), output in hex.
Let’s try to reverse-engineer.

First equation: 37 # 21 = 31FD44 (hex)
Convert 31FD44 from hex to decimal:

31FD44₁₆ = 3×16⁵ + 1×16⁴ + 15×16³ + 13×16² + 4×16¹ + 4×16⁰
= 3×1048576 + 1×65536 + 15×4096 + 13×256 + 4×16 + 4
= 3,145,728 + 65,536 + 61,440 + 3,328 + 64 + 4 = 3,276,096 + 3,392 + 68 = 3,276,096 + 61,440 = 3,211,616? Let me compute properly:
Breakdown:

16⁵ = 1,048,576 → 3× = 3,145,728
16⁴ = 65,536 → 1× = 65,536 → total: 3,211,264
16³ = 4,096 → 15× = 61,440 → total: 3,272,704
16² = 256 → 13× = 3,328 → total: 3,276,032
16¹ = 16 → 4× = 64 → total: 3,276,096
+4 → 3,276,100
So 31FD44₁₆ = 3,276,100

Now, 37 and 21 → 3,276,100

What operation? Let's try:

37 × 21 = 777 → too small
37³ = 50,653 → no
37² × 21² = 1369 × 441 = ~600k → no
(37+21)⁴ = 58⁴ = (58²)² = 3364² = ~11 million → too big
Maybe concatenation? 3721 → too small
Wait — perhaps it's related to sum raised to power or product of something.

Another idea: maybe it's (A × B) concatenated with (A + B) or something, but in hex?

But 37×21=777, 37+21=58 → 77758 → hex of that is 12FB6 → not 31FD44.

Alternatively, perhaps the operation is: A^B + B^A?

37^21 is astronomically huge → no.

Wait, maybe it's LCM or GCD? Unlikely to give such large numbers.

Another idea: maybe the result is (A × B × (A + B)) or similar.

37×21×(37+21) = 37×21×58 = 37×1218 = 45,066 → too small.

But 3,276,100 — let's factor it.

3,276,100 ÷ 100 = 32,761 → is 32,761 a square? √32761 ≈ 181 → 181² = 32,761 → yes!

So 3,276,100 = 100 × 32,761 = 10² × 181² = (10×181)² = 1810²

So 31FD44 = 1810²

Now check: 37 and 21 → 1810?

37 + 21 = 58 → no
37×21 = 777 → no
But 1810 = ? Let's see: 37² + 21² = 1369 + 441 = 1810 → YES!

So 37² + 21² = 1369 + 441 = 1810 → then squared? 1810² = 3,276,100 → matches.

So hypothesis: A # B = (A² + B²)², then convert to hex.

Let’s test second equation.

77 # 44 = 3AFE1B1 (hex)

Compute A² + B² = 77² + 44² = 5929 + 1936 = 7865

Then (7865)² = ?

7865² = (8000 - 135)² = 8000² - 2×8000×135 + 135² = 64,000,000 - 2,160,000 + 18,225 = 64,000,000 - 2,160,000 = 61,840,000 + 18,225 = 61,858,225

Now convert 61,858,225 to hex:

Divide by 16 repeatedly:

61,858,225 ÷ 16 = 3,866,139 rem 1
3,866,139 ÷ 16 = 241,633 rem 11 (B)
241,633 ÷ 16 = 15,102 rem 1
15,102 ÷ 16 = 943 rem 14 (E)
943 ÷ 16 = 58 rem 15 (F)
58 ÷ 16 = 3 rem 10 (A)
3 ÷ 16 = 0 rem 3

So reading remainders bottom-up: 3 A F E 1 B 1 → 3AFE1B1 → matches!

Perfect.

Third equation: 123 # 17 = E2B3D24

A² + B² = 123² + 17² = 15129 + 289 = 15,418

Then (15,418)² = ?

15,418² = (15000 + 418)² = 15000² + 2×15000×418 + 418²
= 225,000,000 + 12,540,000 + 174,724 = 237,540,000 + 174,724 = 237,714,724

Now convert 237,714,724 to hex:

237,714,724 ÷ 16 = 14,857,170 rem 4
14,857,170 ÷ 16 = 928,573 rem 2
928,573 ÷ 16 = 58,035 rem 13 (D)
58,035 ÷ 16 = 3,627 rem 3
3,627 ÷ 16 = 226 rem 11 (B)
226 ÷ 16 = 14 rem 2
14 ÷ 16 = 0 rem 14 (E)

Remainders from bottom: E, 2, B, 3, D, 2, 4 → E2B3D24 → matches!

Perfect.

So the pattern is:

A # B = (A² + B²)² → convert to hexadecimal (uppercase, no prefix)

Now, 71 # 6 = ?

Compute:
A² + B² = 71² + 6² = 5041 + 36 = 5077

Then (5077)² = ?

5077²:
= (5000 + 77)²
= 5000² + 2×5000×77 + 77²
= 25,000,000 + 770,000 + 5,929
= 25,770,000 + 5,929 = 25,775,929

Now convert 25,775,929 to hex:

25,775,929 ÷ 16 = 1,610,995 rem 9
1,610,995 ÷ 16 = 100,687 rem 3
100,687 ÷ 16 = 6,292 rem 15 (F)
6,292 ÷ 16 = 393 rem 4
393 ÷ 16 = 24 rem 9
24 ÷ 16 = 1 rem 8
1 ÷ 16 = 0 rem 1

Reading remainders from bottom: 1, 8, 9, 4, F, 3, 9 → 1894F39

So the answer is 1894F39

Final Answer:
\boxed{1894F39}

@Hunterx I also found his variable token generation speeds odd... since we are all QYFS buddies here I just wanted to post how I run it, as my speeds stay constant and naturally slow down over context fill, as expected.

I'm under the assumption that the QYFS, as magnificent as it may be, SEVERELY lowers its boost clock after more than half of its cores are activated. this drop is expected, however it does have an impact on LLM generations.

I personally run mine with 48 threads active (24 cores), as this allows other processes running on the computer to have a few cores to themselves, and allows ik_llama.cpp to acheive the full turbo possible, which is 3.7ghz. more than half cores activated? the system will lock boost speeds to 2.8ghz, which seems to correlate with the jumps in performance in @shewin initial posting. I bet for some of the latest token generations, the 5090 was doing most of the generations, resulting in slightly higher speeds.

edit; I love this model enough that I bought 512gb ddr5 to run it at IQ4_KSS. there's also a 100B variant I'm interested in checking out, it seems to have insane needle in the haystack performance, which is great for workflows that need to reference data that is found within long contexts... it might run similar to how GLM / GLM-Air perform. (small update- its performs at like 112t/s generation. very fast!)

" edit; I love this model enough that I bought 512gb ddr5 to run it at IQ4_KSS. there's also a 100B variant I'm interested in checking out, it seems to have insane needle in the haystack performance, which is great for workflows that need to reference data that is found within long contexts... it might run similar to how GLM / GLM-Air perform. (small update- its performs at like 112t/s generation. very fast!)"

WOW thats FAST! I don't mind 9 tok/s as long as what i get doesnt take long for it to process internally and waste tokens. I really liked the GLM 4.6 version (mostly agent stuff with Agent Zero) but it couldn't do much else. This one seems like is a monster in everything i get it to do except svg stuff which is a bummer because glm 4.6 can draw really really well. I generally use this with enough prompt power to get it to think as long as it wants and to reanalyses its work and continue to self improve into 131k tokens kind of like o3Pro versions but without costing more than a dollar per question in Open Router.

Just saw your update, the Pokedex UI is very clean, impressive! Just got my system booted with the 512gb upgrade, even with 96gb VRAM I expect it to perform similar to you numbers. Once I finish downloading this huge model I'll get some performance numbers up (hopefully before bed!)

--no edit, as I was formatting this my download finished, here's my impressions--

LOL I'm used to cramming my GPUs with layers on smaller quants, now on iq4 each 3090 only has room for 2 layers (16gb). guess that leaves more room for context! looks like the ram will do the heavy lifting on this one. I can see a pcie5 ssd in my future to speed up model loading times... I've never felt an SSD was slow until loading 400gb of a model into ram....

You know, I've used heavily quanted big models for awhile now, ever since I got deepseek on my old rig running at iq1 1t/s... thats what sparked all of this lol. I can tell just from the initial prompts alone that this iq4 quant is no joke. even compared to the iq2kss, which i was more than happy with, its replies are longer, but make more sense. especially in foreign language output. I need sleep though, here's some performance numbers + launch script.

/home/phone/Documents/ik_llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/Ling-1T-smol-IQ4_KSS-00001-of-00011.gguf \
    --alias ubergarm/Ling-1T-smol-IQ4_KSS \
    --ctx-size 20000 \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    -fa -fmoe -ger \
    -ngl 99 \
    -ot "blk\.(0|1|2|3|4|5)\.ffn_.*=CUDA0" \
    -ot "blk\.(6|7)\.ffn_.*=CUDA1" \
    -ot "blk\.(8|9)\.ffn_.*=CUDA2" \
    -ot "blk\.(10|11)\.ffn_.*=CUDA3" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --no-mmap 
Prompt
- Tokens: 2762
- Time: 47304.744 ms
- Speed: 58.4 t/s
Generation
- Tokens: 701
- Time: 77535.331 ms
- Speed: 9.0 t/s

Sign up or log in to comment