Petite-LLM-3 / README_SMOLLM3_FEATURES.md
Tonic's picture
tries to improve the generation paramaters
d784738
|
raw
history blame
7.03 kB

SmolLM3 Features Implementation

This document describes the SmolLM3 features implemented in the Petite Elle L'Aime 3 chat interface.

๐ŸŽฏ SmolLM3 Features

1. Thinking Mode

SmolLM3 supports extended thinking mode with reasoning traces. The implementation includes:

  • Automatic thinking flags: System prompts automatically get /think or /no_think flags
  • Manual control: Users can manually add thinking flags to system prompts
  • UI toggle: Checkbox to enable/disable thinking mode
  • Response cleaning: Thinking tags are properly cleaned from responses

Usage Examples:

# With thinking enabled (default)
system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./think"

# With thinking disabled
system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./no_think"

# Manual control in UI
enable_thinking = True  # or False

2. Tool Calling

SmolLM3 supports both XML and Python tool calling formats:

XML Tools (Default)

[
  {
    "name": "get_weather",
    "description": "Get the weather in a city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {
          "type": "string",
          "description": "The city to get the weather for"
        }
      }
    }
  }
]

Python Tools

# Tools are called as Python functions in <code> tags
# Example: <code>get_weather(city="Paris")</code>

3. Generation Parameters

Following SmolLM3 recommendations:

  • Temperature: 0.6 (recommended default)
  • Top-p: 0.95 (recommended default)
  • Repetition Penalty: 1.1 (recommended default)
  • Max tokens: 2048 (configurable up to 32,768)
  • Context length: Up to 65,536 tokens (extensible to 128k/256k with YaRN)

4. Long Context Processing

The model supports:

  • Base context: 65,536 tokens
  • Extended context: Up to 256k tokens with YaRN scaling
  • YaRN configuration: Available for longer inputs
{
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 65536,
    "type": "yarn"
  }
}

๐Ÿ”ง Implementation Details

Chat Template Integration

The app uses SmolLM3's chat template with proper thinking and tool calling:

def create_prompt(system_message, user_message, enable_thinking=True, tools=None, use_xml_tools=True):
    formatted_messages = []
    
    # Handle thinking flags
    if system_message and system_message.strip():
        has_think_flag = "/think" in system_message
        has_no_think_flag = "/no_think" in system_message
        
        if not enable_thinking and not has_no_think_flag:
            system_message += "/no_think"
        elif enable_thinking and not has_think_flag and not has_no_think_flag:
            system_message += "/think"
        formatted_messages.append({"role": "system", "content": system_message})
    
    formatted_messages.append({"role": "user", "content": user_message})
    
    # Apply chat template with SmolLM3 features
    template_kwargs = {
        "tokenize": False,
        "add_generation_prompt": True,
        "enable_thinking": enable_thinking
    }
    
    # Add tool calling
    if tools and len(tools) > 0:
        if use_xml_tools:
            template_kwargs["xml_tools"] = tools
        else:
            template_kwargs["python_tools"] = tools
    
    return tokenizer.apply_chat_template(formatted_messages, **template_kwargs)

Tool Call Detection

The app detects and formats tool calls in responses:

# Handle tool calls if present
if parsed_tools and ("<tool_call>" in assistant_response or "<code>" in assistant_response):
    if "<tool_call>" in assistant_response:
        tool_call_match = re.search(r'<tool_call>(.*?)</tool_call>', assistant_response, re.DOTALL)
        if tool_call_match:
            tool_call = tool_call_match.group(1)
            assistant_response += f"\n\n๐Ÿ”ง Tool Call Detected: {tool_call}\n\nNote: This is a simulated tool call."
    elif "<code>" in assistant_response:
        code_match = re.search(r'<code>(.*?)</code>', assistant_response, re.DOTALL)
        if code_match:
            code_call = code_match.group(1)
            assistant_response += f"\n\n๐Ÿ Python Tool Call: {code_call}\n\nNote: This is a simulated Python tool call."

๐ŸŽฎ UI Features

Advanced Settings Panel

  • Temperature slider: 0.01 to 1.0 (default: 0.6)
  • Top-p slider: 0.1 to 1.0 (default: 0.95)
  • Repetition Penalty slider: 1.0 to 2.0 (default: 1.1)
  • Max length slider: 10 to 32,768 tokens (default: 2048)
  • Thinking mode checkbox: Enable/disable reasoning traces
  • Tool calling checkbox: Enable/disable function calling
  • XML vs Python tools: Choose tool calling format
  • Tool definition editor: JSON editor for custom tools

Default Tool Set

The app includes two default tools for demonstration:

  1. get_weather: Get weather information for a city
  2. calculate: Perform mathematical calculations

๐Ÿš€ Usage Examples

Basic Chat with Thinking

system_prompt = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant./think"
user_message = "Explique-moi la gravitรฉ en termes simples."

Chat with Tool Calling

tools = [
    {
        "name": "get_weather",
        "description": "Get the weather in a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The city name"}
            }
        }
    }
]

user_message = "Quel temps fait-il ร  Paris aujourd'hui?"

Agentic Usage

# The model can call tools automatically based on user requests
# Example: "Calculate 15 * 23" will trigger the calculate tool
# Example: "What's the weather in London?" will trigger the get_weather tool

๐Ÿ“‹ Requirements

  • Transformers: v4.53.0+ (required for SmolLM3 support)
  • PyTorch: Latest version
  • Gradio: For the web interface
  • Hugging Face Spaces: For deployment

๐Ÿ”„ Migration from Previous Version

The updated app includes:

  1. SmolLM3-compatible generation parameters
  2. Thinking mode with proper flag handling
  3. Tool calling support (XML and Python)
  4. Extended context support
  5. Improved response cleaning

๐ŸŽฏ Best Practices

  1. Use recommended parameters: temperature=0.6, top_p=0.95, repetition_penalty=1.1
  2. Enable thinking for complex reasoning tasks
  3. Use tool calling for structured tasks
  4. Keep context within limits: 65k tokens base, 256k with YaRN
  5. Test tool definitions before deployment
  6. Adjust repetition penalty: Use 1.0-1.2 for creative tasks, 1.1-1.3 for factual responses

๐Ÿ”— References