Spaces:

Tonic
/

Petite-LLM-3

Running on Zero

App Files Files Community

Tonic commited on Jul 29

Commit

2ee7774

1 Parent(s): 1a6008e

adds torchao

Browse files

Files changed (4) hide show

README_TORCHAO.md +172 -0
app.py +39 -15
requirements.txt +1 -1
test_torchao_inference.py +95 -0

README_TORCHAO.md ADDED Viewed

	@@ -0,0 +1,172 @@

+# TorchAO Quantization Implementation
+This project now uses **TorchAO** for proper quantization and inference. TorchAO is PyTorch's architecture optimization library that provides high-performance quantization techniques.
+## Key Changes Made
+### 1. Proper TorchAO Configuration
+The app now uses the correct TorchAO quantization configurations:
+```python
+from transformers import TorchAoConfig
+from torchao.quantization import Int4WeightOnlyConfig, Int8WeightOnlyConfig
+from torchao.dtypes import Int4CPULayout
+def get_quantization_config():
+    if DEVICE == "cuda":
+        # For CUDA, use Int8WeightOnlyConfig for better performance
+        quant_config = Int8WeightOnlyConfig(group_size=128)
+    else:
+        # For CPU, use Int4WeightOnlyConfig with CPU layout
+        quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
+    return TorchAoConfig(quant_type=quant_config)
+```
+### 2. Correct Model Loading
+The model is now loaded with proper TorchAO quantization:
+```python
+quantization_config = get_quantization_config()
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    quantization_config=quantization_config,
+    device_map="auto" if device == "cuda" else "cpu",
+    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
+    trust_remote_code=True,
+    low_cpu_mem_usage=True,
+)
+```
+### 3. Proper Inference with Cache Implementation
+The most important fix is using `cache_implementation="static"` for generation:
+```python
+output_ids = model.generate(
+    inputs['input_ids'],
+    max_new_tokens=max_tokens,
+    temperature=temperature,
+    top_p=top_p,
+    do_sample=do_sample,
+    attention_mask=inputs['attention_mask'],
+    pad_token_id=tokenizer.eos_token_id,
+    eos_token_id=tokenizer.eos_token_id,
+    cache_implementation="static"  # CRITICAL for TorchAO quantized models
+)
+```
+## TorchAO Quantization Types
+### For CUDA (GPU)
+- **Int8WeightOnlyConfig**: Best performance for most use cases
+- **Int8DynamicActivationInt8WeightConfig**: For more aggressive quantization
+- **GemliteUIntXWeightOnlyConfig**: Optimized for H100/A100 GPUs
+### For CPU
+- **Int4WeightOnlyConfig with Int4CPULayout**: Optimized for CPU deployment
+- **Int8WeightOnlyConfig**: Alternative for better compatibility
+### For Sparsity (Advanced)
+- **Int4WeightOnlyConfig with MarlinSparseLayout**: For 2:4 sparsity
+## Testing the Implementation
+Run the test script to verify TorchAO quantization is working:
+```bash
+python test_torchao_inference.py
+```
+This will test:
+- Model loading with TorchAO quantization
+- Text generation with proper cache implementation
+- Memory usage optimization
+## Performance Benefits
+1. **Memory Reduction**: Up to 50% memory reduction with Int4 quantization
+2. **Faster Inference**: Optimized kernels for quantized operations
+3. **Better Compatibility**: Works with torch.compile for additional speedup
+4. **Device Optimization**: Different configs for CUDA vs CPU
+## Common Issues and Solutions
+### Issue: Model outputs incorrect or garbled text
+**Solution**: Ensure `cache_implementation="static"` is used in generation
+### Issue: Memory errors during loading
+**Solution**: Use appropriate quantization config for your device (Int4 for CPU, Int8 for CUDA)
+### Issue: Slow inference
+**Solution**:
+1. Use `cache_implementation="static"`
+2. Consider using `torch.compile` for additional speedup
+3. Use appropriate group_size (128 is usually optimal)
+## Advanced Configuration
+### Per-Module Quantization
+You can quantize different layers with different configs:
+```python
+from torchao.quantization import ModuleFqnToConfig
+# Skip quantization for certain layers
+config = ModuleFqnToConfig({
+    "_default": Int4WeightOnlyConfig(group_size=128),
+    "model.layers.0.self_attn.q_proj": None  # Skip this layer
+})
+```
+### Autoquantization
+For automatic quantization selection:
+```python
+quantization_config = TorchAoConfig("autoquant", min_sqnr=None)
+# After loading, call:
+model.finalize_autoquant()
+```
+## Requirements
+Make sure you have the latest TorchAO version:
+```bash
+pip install torchao>=0.10.0
+```
+## Deployment Notes
+1. **Serialization**: TorchAO models should be saved with `safe_serialization=False`
+2. **Device Compatibility**: Int4 models are device-specific, Int8 models are portable
+3. **Memory**: Monitor memory usage during deployment
+4. **Performance**: Use `cache_implementation="static"` for best performance
+## Troubleshooting
+### Check TorchAO Version
+```python
+import torchao
+print(torchao.__version__)
+```
+### Verify Quantization
+```python
+# Check if model is quantized
+for name, module in model.named_modules():
+    if hasattr(module, 'weight') and module.weight.dtype != torch.float32:
+        print(f"{name}: {module.weight.dtype}")
+```
+### Memory Usage
+```python
+import torch
+print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
+```
+This implementation ensures proper TorchAO quantization for both loading and inference, with the critical `cache_implementation="static"` parameter for correct generation.

app.py CHANGED Viewed

@@ -1,6 +1,8 @@
 import gradio as gr
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
 import re
 import json
 from typing import List, Dict, Any, Optional
@@ -9,6 +11,7 @@ import spaces
 import os
 import sys
 import requests
 # Set torch to use float32 for better compatibility with quantized models
 torch.set_default_dtype(torch.float32)
@@ -20,20 +23,20 @@ model = None
 tokenizer = None
 DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
 title = "# 🤖 Petite Elle L'Aime 3 - Chat Interface"
-description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the int4 quantized version for efficient CPU deployment."
 presentation1 = """
 ### 🎯 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
-- **Int4 Quantization**: Optimized for CPU deployment with ~50% memory reduction
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
 """
 presentation2 = """### 🎯 Fonctionnalités
 * **Support multilingue** : Anglais, Français, Italien, Portugais, Chinois, Arabe
-* **Quantification Int4** : Optimisé pour un déploiement sur CPU avec une réduction de mémoire d’environ 50 %
 * **Interface de chat interactive** : Conversation en temps réel avec le modèle
-* **Invite système personnalisable** : Définissez la personnalité et le comportement de l’assistant
 * **Mode Réflexion** : Activez le mode raisonnement avec des balises de réflexion
 """
 joinus = """
@@ -63,23 +66,42 @@ def download_chat_template():
         return None
 def load_model():
-    """Load the model and tokenizer"""
     global model, tokenizer
     try:
         logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
         tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
         chat_template = download_chat_template()
-        tokenizer.chat_template = chat_template
         logger.info("Chat template downloaded and set successfully")
-        logger.info(f"Loading int4 model from {MAIN_MODEL_ID}")
         model_kwargs = {
             "device_map": "auto" if DEVICE == "cuda" else "cpu",
-            "torch_dtype": torch.float32,
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
         }
         logger.info(f"Model loading parameters: {model_kwargs}")
@@ -88,7 +110,7 @@ def load_model():
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
-        logger.info("Model loaded successfully")
         return True
     except Exception as e:
@@ -121,11 +143,12 @@ def create_prompt(system_message, user_message, enable_thinking=True):
 @spaces.GPU(duration=94)
 def generate_response(message, history, system_message, max_tokens, temperature, top_p, do_sample, enable_thinking=True):
-    """Generate response using the model"""
     global model, tokenizer
     if model is None or tokenizer is None:
         return "Error: Model not loaded. Please wait for the model to load."
     full_prompt = create_prompt(system_message, message, enable_thinking)
     if not full_prompt:
@@ -136,6 +159,7 @@ def generate_response(message, history, system_message, max_tokens, temperature,
     if DEVICE == "cuda":
         inputs = {k: v.cuda() for k, v in inputs.items()}
     with torch.no_grad():
         output_ids = model.generate(
             inputs['input_ids'],
@@ -145,8 +169,9 @@ def generate_response(message, history, system_message, max_tokens, temperature,
             do_sample=do_sample,
             attention_mask=inputs['attention_mask'],
             pad_token_id=tokenizer.eos_token_id,
-            eos_token_id=tokenizer.eos_token_id
-            )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(full_prompt):].strip()
         assistant_response = re.sub(r'<\|im_start\|>.*?<\|im_end\|>', '', assistant_response, flags=re.DOTALL)
@@ -175,7 +200,7 @@ def bot(history, system_prompt, max_length, temperature, top_p, advanced_checkbo
     return history
 # Load model on startup
-logger.info("Starting model loading process...")
 load_model()
 # Create Gradio interface
@@ -259,6 +284,5 @@ with gr.Blocks() as demo:
     )
 if __name__ == "__main__":
     demo.queue()
     demo.launch(ssr_mode=False, mcp_server=True)

 import gradio as gr
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+from torchao.quantization import Int4WeightOnlyConfig, Int8WeightOnlyConfig, Int8DynamicActivationInt8WeightConfig
+from torchao.dtypes import Int4CPULayout
 import re
 import json
 from typing import List, Dict, Any, Optional
 import os
 import sys
 import requests
+import accelerate
 # Set torch to use float32 for better compatibility with quantized models
 torch.set_default_dtype(torch.float32)
 tokenizer = None
 DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
 title = "# 🤖 Petite Elle L'Aime 3 - Chat Interface"
+description = "A fine-tuned version of SmolLM3-3B optimized for French conversations. This is the torchao quantized version for efficient deployment."
 presentation1 = """
 ### 🎯 Features
 - **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
+- **TorchAO Quantization**: Optimized for deployment with memory reduction
 - **Interactive Chat Interface**: Real-time conversation with the model
 - **Customizable System Prompt**: Define the assistant's personality and behavior
 - **Thinking Mode**: Enable reasoning mode with thinking tags
 """
 presentation2 = """### 🎯 Fonctionnalités
 * **Support multilingue** : Anglais, Français, Italien, Portugais, Chinois, Arabe
+* **Quantification TorchAO** : Optimisé pour un déploiement avec réduction de mémoire
 * **Interface de chat interactive** : Conversation en temps réel avec le modèle
+* **Invite système personnalisable** : Définissez la personnalité et le comportement de l'assistant
 * **Mode Réflexion** : Activez le mode raisonnement avec des balises de réflexion
 """
 joinus = """
         return None
+def get_quantization_config():
+    """Get the appropriate quantization config based on device"""
+    if DEVICE == "cuda":
+        # For CUDA, use Int8WeightOnlyConfig for better performance
+        quant_config = Int8WeightOnlyConfig(group_size=128)
+    else:
+        # For CPU, use Int4WeightOnlyConfig with CPU layout
+        quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
+    return TorchAoConfig(quant_type=quant_config)
 def load_model():
+    """Load the model and tokenizer with torchao quantization"""
     global model, tokenizer
     try:
         logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
         tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder="int4")
         chat_template = download_chat_template()
+        if chat_template:
+            tokenizer.chat_template = chat_template
         logger.info("Chat template downloaded and set successfully")
+        logger.info(f"Loading model with torchao quantization from {MAIN_MODEL_ID}")
+        # Get quantization config
+        quantization_config = get_quantization_config()
+        logger.info(f"Using quantization config: {quantization_config}")
         model_kwargs = {
             "device_map": "auto" if DEVICE == "cuda" else "cpu",
+            "torch_dtype": torch.bfloat16 if DEVICE == "cuda" else torch.float32,
             "trust_remote_code": True,
             "low_cpu_mem_usage": True,
+            "quantization_config": quantization_config,
         }
         logger.info(f"Model loading parameters: {model_kwargs}")
         if tokenizer.pad_token_id is None:
             tokenizer.pad_token_id = tokenizer.eos_token_id
+        logger.info("Model loaded successfully with torchao quantization")
         return True
     except Exception as e:
 @spaces.GPU(duration=94)
 def generate_response(message, history, system_message, max_tokens, temperature, top_p, do_sample, enable_thinking=True):
+    """Generate response using the torchao quantized model"""
     global model, tokenizer
     if model is None or tokenizer is None:
         return "Error: Model not loaded. Please wait for the model to load."
     full_prompt = create_prompt(system_message, message, enable_thinking)
     if not full_prompt:
     if DEVICE == "cuda":
         inputs = {k: v.cuda() for k, v in inputs.items()}
     with torch.no_grad():
         output_ids = model.generate(
             inputs['input_ids'],
             do_sample=do_sample,
             attention_mask=inputs['attention_mask'],
             pad_token_id=tokenizer.eos_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+            cache_implementation="static"  # Important for torchao quantized models
+        )
         response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         assistant_response = response[len(full_prompt):].strip()
         assistant_response = re.sub(r'<\|im_start\|>.*?<\|im_end\|>', '', assistant_response, flags=re.DOTALL)
     return history
 # Load model on startup
+logger.info("Starting model loading process with torchao quantization...")
 load_model()
 # Create Gradio interface
     )
 if __name__ == "__main__":
     demo.queue()
     demo.launch(ssr_mode=False, mcp_server=True)

requirements.txt CHANGED Viewed

@@ -2,7 +2,7 @@ gradio>=5.38.2
 torch>=2.0.0
 transformers>=4.54.0
 accelerate>=0.20.0
-torchao>=0.1.0
 safetensors>=0.4.0
 tokenizers>=0.21.2
 pyyaml>=6.0

 torch>=2.0.0
 transformers>=4.54.0
 accelerate>=0.20.0
+torchao>=0.10.0
 safetensors>=0.4.0
 tokenizers>=0.21.2
 pyyaml>=6.0

test_torchao_inference.py ADDED Viewed

	@@ -0,0 +1,95 @@

+#!/usr/bin/env python3
+"""
+Test script for torchao quantization inference
+"""
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+from torchao.quantization import Int4WeightOnlyConfig, Int8WeightOnlyConfig
+from torchao.dtypes import Int4CPULayout
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def test_torchao_quantization():
+    """Test torchao quantization with different configurations"""
+    model_id = "Tonic/petite-elle-L-aime-3-sft"
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Testing torchao quantization on device: {device}")
+    # Test different quantization configs
+    configs_to_test = []
+    if device == "cuda":
+        configs_to_test.append(("Int8WeightOnlyConfig", Int8WeightOnlyConfig(group_size=128)))
+    else:
+        configs_to_test.append(("Int4WeightOnlyConfig CPU", Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())))
+    for config_name, quant_config in configs_to_test:
+        logger.info(f"\nTesting {config_name}...")
+        try:
+            # Create quantization config
+            quantization_config = TorchAoConfig(quant_type=quant_config)
+            # Load tokenizer
+            tokenizer = AutoTokenizer.from_pretrained(model_id)
+            if tokenizer.pad_token_id is None:
+                tokenizer.pad_token_id = tokenizer.eos_token_id
+            # Load model with quantization
+            model_kwargs = {
+                "device_map": "auto" if device == "cuda" else "cpu",
+                "torch_dtype": torch.bfloat16 if device == "cuda" else torch.float32,
+                "trust_remote_code": True,
+                "low_cpu_mem_usage": True,
+                "quantization_config": quantization_config,
+            }
+            logger.info(f"Loading model with {config_name}...")
+            model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
+            # Test generation
+            test_prompt = "Bonjour, comment allez-vous?"
+            inputs = tokenizer(test_prompt, return_tensors="pt")
+            if device == "cuda":
+                inputs = {k: v.cuda() for k, v in inputs.items()}
+            logger.info("Generating response...")
+            with torch.no_grad():
+                output_ids = model.generate(
+                    inputs['input_ids'],
+                    max_new_tokens=50,
+                    temperature=0.7,
+                    top_p=0.95,
+                    do_sample=True,
+                    attention_mask=inputs['attention_mask'],
+                    pad_token_id=tokenizer.eos_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    cache_implementation="static"  # Important for torchao
+                )
+            response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+            assistant_response = response[len(test_prompt):].strip()
+            logger.info(f"✅ {config_name} test successful!")
+            logger.info(f"Input: {test_prompt}")
+            logger.info(f"Output: {assistant_response}")
+            # Clean up
+            del model
+            torch.cuda.empty_cache() if device == "cuda" else None
+        except Exception as e:
+            logger.error(f"❌ {config_name} test failed: {e}")
+            continue
+    logger.info("\n🎉 All torchao quantization tests completed!")
+if __name__ == "__main__":
+    test_torchao_quantization()