kernels-community

Team

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

danieldk updated a model 2 days ago

kernels-community/quantization_gptq

danieldk published a model 2 days ago

kernels-community/quantization_gptq

danieldk updated a model 2 days ago

kernels-community/quantization_bitsandbytes

View all activity

danieldk

updated a model 2 days ago

kernels-community/quantization_gptq

Updated 2 days ago

danieldk

published a model 2 days ago

kernels-community/quantization_gptq

Updated 2 days ago

danieldk

updated a model 2 days ago

kernels-community/quantization_bitsandbytes

Updated 2 days ago

danieldk

published a model 2 days ago

kernels-community/quantization_bitsandbytes

Updated 2 days ago

ahadnagy

in kernels-community/megablocks 5 days ago

Update build.toml

#4 opened 5 days ago by

ahadnagy

danieldk

updated a model 11 days ago

kernels-community/flash-attn2

Updated 11 days ago • 2.16k • 17

medmekk

updated a model 11 days ago

kernels-community/scattermoe

Updated 11 days ago

lewtun

updated a model 11 days ago

kernels-community/flash-attn2

Updated 11 days ago • 2.16k • 17

danieldk

updated a model 12 days ago

kernels-community/flash-attn3

Updated 12 days ago • 111k • 24

danieldk

updated a model 15 days ago

kernels-community/rmsnorm

Updated 15 days ago • 390

yonigozlan

updated a model 22 days ago

kernels-community/cv_utils

Updated 22 days ago • 320

drbh

updated a Space 26 days ago

Kernels Benchmarks

⏲

Manage and navigate through HTML content within an iframe

mfuntowicz

in kernels-community/rotary 28 days ago

Add Windows Kernel for PyTorch 2.9 + CUDA13

#5 opened 29 days ago by

mfuntowicz

updated a model 29 days ago

kernels-community/rotary

Updated 28 days ago • 2.34k • 3

medmekk

updated a model 30 days ago

kernels-community/gpt-oss-metal-kernels

Updated 30 days ago • 19 • 2

medmekk

published a model 30 days ago

kernels-community/gpt-oss-metal-kernels

Updated 30 days ago • 19 • 2

danieldk

updated a model about 1 month ago

kernels-community/paged-attention

Updated Nov 3 • 2.2k • 5

nouamanetazi

posted an update about 1 month ago

Post

3944

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

danieldk

posted an update about 2 months ago

Post

482

We have released kernel-builder 0.7.0: https://github.com/huggingface/kernel-builder/releases/tag/v0.7.0

Headline features:

* 🔮 Supports building kernels for the brand-new PyTorch 2.9.0.
* 🪟 Experimental support for building Windows kernels.

lysandre

posted an update 3 months ago

Post

7044

We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!

6 replies

AI & ML interests

Recent Activity

Team members 16