Qwerky Optimized Llama3.1 Mamba Hybrid - 8B Instruct

This is a hybrid Mamba-Transformer model based on the Llama 3.1 architecture, distilled from Llama 3.3 70B into a 8B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 8B parameter model comparable in quality to Llama's 3.1 8B but running at speeds as fast or faster than Llama's 3.2 3B model.

Model Developer: Qwerky AI

⚠️ Important Requirements

CUDA is required to run this model. This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:

A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
CUDA toolkit installed
PyTorch with CUDA support

Model Details

Model Type: QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
Architecture: QwerkyLlamaMambaHybridForCausalLM
Base Model: Llama-3.1-8B
Mamba Type: MAMBA

Model Configuration

Vocabulary Size: 128256
Hidden Size: 4096
Number of Layers: 32
Number of Attention Heads: 32
Intermediate Size: 14336

How to Use

This model can be loaded using HuggingFace Transformers with AutoTokenizer and AutoModelForCausalLM. The model uses custom configuration and modeling files that are automatically loaded via the auto_map in config.json.

Installation

First, install the required dependencies:

pip install transformers torch safetensors
pip install flash-attn --no-build-isolation
pip install mamba-ssm --no-build-isolation
pip install causal-conv1d>=1.2.0 --no-build-isolation

Note: flash-attn compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:

MAX_JOBS=1 pip install flash-attn --no-build-isolation

Or set it as an environment variable:

export MAX_JOBS=1
pip install flash-attn --no-build-isolation

Loading the Model

From HuggingFace Hub

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
).to("cuda")

From Local Directory

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
model = AutoModelForCausalLM.from_pretrained(
    "./path/to/model",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
).to("cuda")

Generating Text

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and move to CUDA
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Ensure model is in bfloat16 or float16 for FlashAttention compatibility
model = model.to(torch.bfloat16)

# Generate response
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.7,
)

# Decode output
response = tokenizer.decode(outputs[0])
print(response)

Model Files

This model repository contains:

config.json - Model configuration with auto_map for custom classes
modeling_qwerky_llama_mamba_hybrid.py - Custom modeling class
configuration_qwerky_llama_mamba_hybrid.py - Custom configuration class
model.safetensors or model-*.safetensors - Model weights (sharded if >5GB)
model.safetensors.index.json - Index file for sharded weights (if applicable)
tokenizer.json, tokenizer_config.json - Tokenizer files
README.md - This file

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.30+
safetensors
mamba-ssm (for MAMBA models)
causal-conv1d>=1.2.0 (for MAMBA models)
flash-attn (for optimized attention)

Citation

If you use this model, please cite:

@misc{qwerky_llama_mamba_hybrid,
  title={QwerkyLlamaMambaHybrid},
  author={Qwerky AI, Inc.},
  year={2025},
  publisher={HuggingFace}
}

License

This model is licensed under the Qwerky Distilled Model License Agreement. See the LICENSE file for more details.

Downloads last month: 65

Safetensors

Model size

9B params

Tensor type

BF16

Collection including QwerkyAI/Qwerky-Optimized-Llama3.1-Mamba-0.2-8B-Instruct

Qwerky Optimized Hybrid Attention Experiments

Collection

I can't believe it's not attention. • 2 items • Updated 3 days ago • 1