Qwerky Optimized Llama3.1 Mamba Hybrid - 8B Instruct

This is a hybrid Mamba-Transformer model based on the Llama 3.1 architecture, distilled from Llama 3.3 70B into a 8B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 8B parameter model comparable in quality to Llama's 3.1 8B but running at speeds as fast or faster than Llama's 3.2 3B model.

Model Developer: Qwerky AI

⚠️ Important Requirements

CUDA is required to run this model. This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:

  • A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
  • CUDA toolkit installed
  • PyTorch with CUDA support

Model Details

  • Model Type: QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
  • Architecture: QwerkyLlamaMambaHybridForCausalLM
  • Base Model: Llama-3.1-8B
  • Mamba Type: MAMBA

Model Configuration

  • Vocabulary Size: 128256
  • Hidden Size: 4096
  • Number of Layers: 32
  • Number of Attention Heads: 32
  • Intermediate Size: 14336

How to Use

This model can be loaded using HuggingFace Transformers with AutoTokenizer and AutoModelForCausalLM. The model uses custom configuration and modeling files that are automatically loaded via the auto_map in config.json.

Installation

First, install the required dependencies:

pip install transformers torch safetensors
pip install flash-attn --no-build-isolation
pip install mamba-ssm --no-build-isolation
pip install causal-conv1d>=1.2.0 --no-build-isolation

Note: flash-attn compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:

MAX_JOBS=1 pip install flash-attn --no-build-isolation

Or set it as an environment variable:

export MAX_JOBS=1
pip install flash-attn --no-build-isolation

Loading the Model

From HuggingFace Hub

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
).to("cuda")

From Local Directory

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
model = AutoModelForCausalLM.from_pretrained(
    "./path/to/model",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
).to("cuda")

Generating Text

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and move to CUDA
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Ensure model is in bfloat16 or float16 for FlashAttention compatibility
model = model.to(torch.bfloat16)

# Generate response
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.7,
)

# Decode output
response = tokenizer.decode(outputs[0])
print(response)

Model Files

This model repository contains:

  • config.json - Model configuration with auto_map for custom classes
  • modeling_qwerky_llama_mamba_hybrid.py - Custom modeling class
  • configuration_qwerky_llama_mamba_hybrid.py - Custom configuration class
  • model.safetensors or model-*.safetensors - Model weights (sharded if >5GB)
  • model.safetensors.index.json - Index file for sharded weights (if applicable)
  • tokenizer.json, tokenizer_config.json - Tokenizer files
  • README.md - This file

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.30+
  • safetensors
  • mamba-ssm (for MAMBA models)
  • causal-conv1d>=1.2.0 (for MAMBA models)
  • flash-attn (for optimized attention)

Citation

If you use this model, please cite:

@misc{qwerky_llama_mamba_hybrid,
  title={QwerkyLlamaMambaHybrid},
  author={Qwerky AI, Inc.},
  year={2025},
  publisher={HuggingFace}
}

License

This model is licensed under the Qwerky Distilled Model License Agreement. See the LICENSE file for more details.

Downloads last month
65
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including QwerkyAI/Qwerky-Optimized-Llama3.1-Mamba-0.2-8B-Instruct