Spaces:

Tonic
/

Petite-LLM-3

Running on Zero

App Files Files Community

Petite-LLM-3 / README_SMOLLM3_FEATURES.md

Tonic

tries to improve the generation paramaters

d784738 5 months ago

preview code

raw

history blame

7.03 kB

SmolLM3 Features Implementation

This document describes the SmolLM3 features implemented in the Petite Elle L'Aime 3 chat interface.

🎯 SmolLM3 Features

1. Thinking Mode

SmolLM3 supports extended thinking mode with reasoning traces. The implementation includes:

Automatic thinking flags: System prompts automatically get /think or /no_think flags
Manual control: Users can manually add thinking flags to system prompts
UI toggle: Checkbox to enable/disable thinking mode
Response cleaning: Thinking tags are properly cleaned from responses

Usage Examples:

# With thinking enabled (default)
system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./think"

# With thinking disabled
system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./no_think"

# Manual control in UI
enable_thinking = True  # or False

2. Tool Calling

SmolLM3 supports both XML and Python tool calling formats:

XML Tools (Default)

[
  {
    "name": "get_weather",
    "description": "Get the weather in a city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {
          "type": "string",
          "description": "The city to get the weather for"
        }
      }
    }
  }
]

Python Tools

# Tools are called as Python functions in <code> tags
# Example: <code>get_weather(city="Paris")</code>

3. Generation Parameters

Following SmolLM3 recommendations:

Temperature: 0.6 (recommended default)
Top-p: 0.95 (recommended default)
Repetition Penalty: 1.1 (recommended default)
Max tokens: 2048 (configurable up to 32,768)
Context length: Up to 65,536 tokens (extensible to 128k/256k with YaRN)

4. Long Context Processing

The model supports:

Base context: 65,536 tokens
Extended context: Up to 256k tokens with YaRN scaling
YaRN configuration: Available for longer inputs

{
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 65536,
    "type": "yarn"
  }
}

🔧 Implementation Details

Chat Template Integration

The app uses SmolLM3's chat template with proper thinking and tool calling:

def create_prompt(system_message, user_message, enable_thinking=True, tools=None, use_xml_tools=True):
    formatted_messages = []
    
    # Handle thinking flags
    if system_message and system_message.strip():
        has_think_flag = "/think" in system_message
        has_no_think_flag = "/no_think" in system_message
        
        if not enable_thinking and not has_no_think_flag:
            system_message += "/no_think"
        elif enable_thinking and not has_think_flag and not has_no_think_flag:
            system_message += "/think"
        formatted_messages.append({"role": "system", "content": system_message})
    
    formatted_messages.append({"role": "user", "content": user_message})
    
    # Apply chat template with SmolLM3 features
    template_kwargs = {
        "tokenize": False,
        "add_generation_prompt": True,
        "enable_thinking": enable_thinking
    }
    
    # Add tool calling
    if tools and len(tools) > 0:
        if use_xml_tools:
            template_kwargs["xml_tools"] = tools
        else:
            template_kwargs["python_tools"] = tools
    
    return tokenizer.apply_chat_template(formatted_messages, **template_kwargs)

Tool Call Detection

The app detects and formats tool calls in responses:

# Handle tool calls if present
if parsed_tools and ("<tool_call>" in assistant_response or "<code>" in assistant_response):
    if "<tool_call>" in assistant_response:
        tool_call_match = re.search(r'<tool_call>(.*?)</tool_call>', assistant_response, re.DOTALL)
        if tool_call_match:
            tool_call = tool_call_match.group(1)
            assistant_response += f"\n\n🔧 Tool Call Detected: {tool_call}\n\nNote: This is a simulated tool call."
    elif "<code>" in assistant_response:
        code_match = re.search(r'<code>(.*?)</code>', assistant_response, re.DOTALL)
        if code_match:
            code_call = code_match.group(1)
            assistant_response += f"\n\n🐍 Python Tool Call: {code_call}\n\nNote: This is a simulated Python tool call."

🎮 UI Features

Advanced Settings Panel

Temperature slider: 0.01 to 1.0 (default: 0.6)
Top-p slider: 0.1 to 1.0 (default: 0.95)
Repetition Penalty slider: 1.0 to 2.0 (default: 1.1)
Max length slider: 10 to 32,768 tokens (default: 2048)
Thinking mode checkbox: Enable/disable reasoning traces
Tool calling checkbox: Enable/disable function calling
XML vs Python tools: Choose tool calling format
Tool definition editor: JSON editor for custom tools

Default Tool Set

The app includes two default tools for demonstration:

get_weather: Get weather information for a city
calculate: Perform mathematical calculations

🚀 Usage Examples

Basic Chat with Thinking

system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./think"
user_message = "Explique-moi la gravité en termes simples."

Chat with Tool Calling

tools = [
    {
        "name": "get_weather",
        "description": "Get the weather in a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"}
            }
        }
    }
]

user_message = "Quel temps fait-il à Paris aujourd'hui?"

Agentic Usage

# The model can call tools automatically based on user requests
# Example: "Calculate 15 * 23" will trigger the calculate tool
# Example: "What's the weather in London?" will trigger the get_weather tool

📋 Requirements

Transformers: v4.53.0+ (required for SmolLM3 support)
PyTorch: Latest version
Gradio: For the web interface
Hugging Face Spaces: For deployment

🔄 Migration from Previous Version

The updated app includes:

SmolLM3-compatible generation parameters
Thinking mode with proper flag handling
Tool calling support (XML and Python)
Extended context support
Improved response cleaning

🎯 Best Practices

Use recommended parameters: temperature=0.6, top_p=0.95, repetition_penalty=1.1
Enable thinking for complex reasoning tasks
Use tool calling for structured tasks
Keep context within limits: 65k tokens base, 256k with YaRN
Test tool definitions before deployment
Adjust repetition penalty: Use 1.0-1.2 for creative tasks, 1.1-1.3 for factual responses