# Multimodal Settings & File Rendering - Implementation Plan

## Executive Summary

This document provides a comprehensive analysis of the current settings implementation, multimodal input handling, and file rendering in `src/app.py`, along with a detailed implementation plan to improve the user experience.

## 1. Current Settings Analysis

### 1.1 Settings Structure in `src/app.py`

**Current Implementation (Lines 741-887):**

1. **Sidebar Structure:**
   - Authentication section (lines 745-750)
   - About section (lines 752-764)
   - Settings section (lines 767-850):
     - Research Configuration Accordion (lines 771-796):
       - `mode_radio`: Orchestrator mode selector
       - `graph_mode_radio`: Graph research mode selector
       - `use_graph_checkbox`: Graph execution toggle
     - Audio Output Accordion (lines 798-850):
       - `enable_audio_output_checkbox`: TTS enable/disable
       - `tts_voice_dropdown`: Voice selection
       - `tts_speed_slider`: Speech speed control
       - `tts_gpu_dropdown`: GPU type (non-interactive, visible only if Modal available)

2. **Hidden Components (Lines 852-865):**
   - `hf_model_dropdown`: Hidden Textbox for model selection
   - `hf_provider_dropdown`: Hidden Textbox for provider selection

3. **Main Area Components (Lines 867-887):**
   - `audio_output`: Audio output component (visible based on `settings.enable_audio_output`)
   - Visibility update function for TTS components

### 1.2 Settings Flow

**Settings → Function Parameters:**
- Settings from sidebar accordions are passed via `additional_inputs` to `research_agent()` function
- Hidden textboxes are also passed but use empty strings (converted to None)
- OAuth token/profile are automatically passed by Gradio

**Function Signature (Lines 535-546):**
```python
async def research_agent(
    message: str | MultimodalPostprocess,
    history: list[dict[str, Any]],
    mode: str = "simple",
    hf_model: str | None = None,
    hf_provider: str | None = None,
    graph_mode: str = "auto",
    use_graph: bool = True,
    tts_voice: str = "af_heart",
    tts_speed: float = 1.0,
    oauth_token: gr.OAuthToken | None = None,
    oauth_profile: gr.OAuthProfile | None = None,
)
```

### 1.3 Issues Identified

1. **Settings Organization:**
   - Audio output component is in main area, not sidebar
   - Hidden components (hf_model, hf_provider) should be visible or removed
   - No image input enable/disable setting (only audio input has this)

2. **Visibility:**
   - Audio output visibility is controlled by checkbox, but component placement is suboptimal
   - TTS settings visibility is controlled by checkbox change event

3. **Configuration Gaps:**
   - No `enable_image_input` setting in config (only `enable_audio_input` exists)
   - Image processing always happens if files are present (line 626 comment says "not just when enable_image_input is True" but setting doesn't exist)

## 2. Multimodal Input Analysis

### 2.1 Current Implementation

**ChatInterface Configuration (Line 892-958):**
- `multimodal=True`: Enables MultimodalTextbox component
- MultimodalTextbox automatically provides:
  - Text input
  - Image upload button
  - Audio recording button
  - File upload support

**Input Processing (Lines 613-642):**
- Message can be `str` or `MultimodalPostprocess` (dict format)
- MultimodalPostprocess format: `{"text": str, "files": list[FileData], "audio": tuple | None}`
- Processing happens in `research_agent()` function:
  - Extracts text, files, and audio from message
  - Calls `multimodal_service.process_multimodal_input()`
  - Condition: `if files or (audio_input_data is not None and settings.enable_audio_input)`

**Multimodal Service (src/services/multimodal_processing.py):**
- Processes audio input if `settings.enable_audio_input` is True
- Processes image files (no enable/disable check - always processes if files present)
- Extracts text from images using OCR service
- Transcribes audio using STT service

### 2.2 Gradio Documentation Findings

**MultimodalTextbox (ChatInterface with multimodal=True):**
- Automatically provides image and audio input capabilities
- Inputs are always visible when ChatInterface is rendered
- No explicit visibility control needed - it's part of the textbox component
- Files are handled via `files` array in MultimodalPostprocess
- Audio recording is handled via `audio` tuple in MultimodalPostprocess

**Reference Implementation Pattern:**
```python
gr.ChatInterface(
    fn=chat_function,
    multimodal=True,  # Enables image/audio inputs
    # ... other parameters
)
```

### 2.3 Issues Identified

1. **Visibility:**
   - Multimodal inputs ARE always visible (they're part of MultimodalTextbox)
   - No explicit control needed - this is working correctly
   - However, users may not realize image/audio inputs are available

2. **Configuration:**
   - No `enable_image_input` setting to disable image processing
   - Image processing always happens if files are present
   - Audio processing respects `settings.enable_audio_input`

3. **User Experience:**
   - No visual indication that multimodal inputs are available
   - Description mentions "🎤 Multimodal Support" but could be more prominent

## 3. File Rendering Analysis

### 3.1 Current Implementation

**File Detection (Lines 168-195):**
- `_is_file_path()`: Checks if text looks like a file path
- Checks for file extensions and path separators

**File Rendering in Events (Lines 242-298):**
- For "complete" events, checks `event.data` for "files" or "file" keys
- Validates files exist using `os.path.exists()`
- Formats files as markdown download links: `📎 [Download: filename](filepath)`
- Stores files in metadata for potential future use

**File Links Format:**
```python
file_links = "\n\n".join([
    f"📎 [Download: {_get_file_name(f)}]({f})" 
    for f in valid_files
])
result["content"] = f"{content}\n\n{file_links}"
```

### 3.2 Issues Identified

1. **Rendering Method:**
   - Uses markdown links in content string
   - May not work reliably in all Gradio versions
   - Better approach: Use Gradio's native file components or File component

2. **File Validation:**
   - Only checks if file exists
   - Doesn't validate file type or size
   - No error handling for inaccessible files

3. **User Experience:**
   - Files appear as text links, not as proper file components
   - No preview for images/PDFs
   - No file size information

## 4. Implementation Plan

### Activity 1: Settings Reorganization

**Goal:** Move all settings to sidebar with better organization

**File:** `src/app.py`

**Tasks:**

1. **Move Audio Output Component to Sidebar (Lines 867-887)**
   - Move `audio_output` component into sidebar
   - Place it in Audio Output accordion or create separate section
   - Update visibility logic to work within sidebar

2. **Add Image Input Settings (New)**
   - Add `enable_image_input` checkbox to sidebar
   - Create "Image Input" accordion or add to existing "Multimodal Input" accordion
   - Update config to include `enable_image_input` setting

3. **Organize Settings Accordions**
   - Research Configuration (existing)
   - Multimodal Input (new - combine image and audio input settings)
   - Audio Output (existing - move component here)
   - Model Configuration (new - for hf_model, hf_provider if we make them visible)

**Subtasks:**
- [ ] Line 867-871: Move `audio_output` component definition into sidebar
- [ ] Line 873-887: Update visibility update function to work with sidebar placement
- [ ] Line 798-850: Reorganize Audio Output accordion to include audio_output component
- [ ] Line 767-796: Keep Research Configuration as-is
- [ ] After line 796: Add new "Multimodal Input" accordion with enable_image_input and enable_audio_input checkboxes
- [ ] Line 852-865: Consider making hf_model and hf_provider visible or remove them

### Activity 2: Multimodal Input Visibility

**Goal:** Ensure multimodal inputs are always visible and well-documented

**File:** `src/app.py`

**Tasks:**

1. **Verify Multimodal Inputs Are Visible**
   - Confirm `multimodal=True` in ChatInterface (already done - line 894)
   - Add visual indicators in description
   - Add tooltips or help text

2. **Add Image Input Configuration**
   - Add `enable_image_input` to config (src/utils/config.py)
   - Update multimodal processing to respect this setting
   - Add UI control in sidebar

**Subtasks:**
- [ ] Line 894: Verify `multimodal=True` is set (already correct)
- [ ] Line 908: Enhance description to highlight multimodal capabilities
- [ ] src/utils/config.py: Add `enable_image_input: bool = Field(default=True, ...)`
- [ ] src/services/multimodal_processing.py: Add check for `settings.enable_image_input` before processing images
- [ ] src/app.py: Add enable_image_input checkbox to sidebar

### Activity 3: File Rendering Improvements

**Goal:** Improve file rendering using proper Gradio components

**File:** `src/app.py`

**Tasks:**

1. **Improve File Rendering Method**
   - Use Gradio File component or proper file handling
   - Add file previews for images
   - Show file size and type information

2. **Enhance File Validation**
   - Validate file types
   - Check file accessibility
   - Handle errors gracefully

**Subtasks:**
- [ ] Line 280-296: Replace markdown link approach with proper file component rendering
- [ ] Line 168-195: Enhance `_is_file_path()` to validate file types
- [ ] Line 242-298: Update `event_to_chat_message()` to use Gradio File components
- [ ] Add file preview functionality for images
- [ ] Add error handling for inaccessible files

### Activity 4: Configuration Updates

**Goal:** Add missing configuration settings

**File:** `src/utils/config.py`

**Tasks:**

1. **Add Image Input Setting**
   - Add `enable_image_input` field
   - Add `ocr_api_url` field if missing
   - Add property methods for availability checks

**Subtasks:**
- [ ] After line 147: Add `enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface")`
- [ ] Check if `ocr_api_url` exists (should be in config)
- [ ] Add `image_ocr_available` property if missing

### Activity 5: Multimodal Service Updates

**Goal:** Respect image input enable/disable setting

**File:** `src/services/multimodal_processing.py`

**Tasks:**

1. **Add Image Input Check**
   - Check `settings.enable_image_input` before processing images
   - Log when image processing is skipped due to setting

**Subtasks:**
- [ ] Line 66-77: Add check for `settings.enable_image_input` before processing image files
- [ ] Add logging when image processing is skipped

## 5. Detailed File-Level Tasks

### File: `src/app.py`

**Line-Level Subtasks:**

1. **Lines 741-850: Sidebar Reorganization**
   - [ ] 741-765: Keep authentication and about sections
   - [ ] 767-796: Keep Research Configuration accordion
   - [ ] 797: Add new "Multimodal Input" accordion after Research Configuration
   - [ ] 798-850: Reorganize Audio Output accordion, move audio_output component here
   - [ ] 852-865: Review hidden components - make visible or remove

2. **Lines 867-887: Audio Output Component**
   - [ ] 867-871: Move `audio_output` definition into sidebar (Audio Output accordion)
   - [ ] 873-887: Update visibility function to work with sidebar placement

3. **Lines 892-958: ChatInterface Configuration**
   - [ ] 894: Verify `multimodal=True` (already correct)
   - [ ] 908: Enhance description with multimodal capabilities
   - [ ] 946-956: Review `additional_inputs` - ensure all settings are included

4. **Lines 242-298: File Rendering**
   - [ ] 280-296: Replace markdown links with proper file component rendering
   - [ ] Add file preview for images
   - [ ] Add file size/type information

5. **Lines 613-642: Multimodal Input Processing**
   - [ ] 626: Update condition to check `settings.enable_image_input` for files
   - [ ] Add logging for when image processing is skipped

### File: `src/utils/config.py`

**Line-Level Subtasks:**

1. **Lines 143-180: Audio/Image Configuration**
   - [ ] 144-147: `enable_audio_input` exists (keep as-is)
   - [ ] After 147: Add `enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface")`
   - [ ] Check if `ocr_api_url` exists (add if missing)
   - [ ] Add `image_ocr_available` property method

### File: `src/services/multimodal_processing.py`

**Line-Level Subtasks:**

1. **Lines 65-77: Image Processing**
   - [ ] 66: Add check: `if files and settings.enable_image_input:`
   - [ ] 71-77: Keep image processing logic inside the new condition
   - [ ] Add logging when image processing is skipped

## 6. Testing Checklist

- [ ] Verify all settings are in sidebar
- [ ] Test multimodal inputs (image upload, audio recording)
- [ ] Test file rendering (markdown, PDF, images)
- [ ] Test enable/disable toggles for image and audio inputs
- [ ] Test audio output generation and display
- [ ] Test file download links
- [ ] Verify settings persist across chat sessions
- [ ] Test on different screen sizes (responsive design)

## 7. Implementation Order

1. **Phase 1: Configuration** (Foundation)
   - Add `enable_image_input` to config
   - Update multimodal service to respect setting

2. **Phase 2: Settings Reorganization** (UI)
   - Move audio output to sidebar
   - Add image input settings to sidebar
   - Organize accordions

3. **Phase 3: File Rendering** (Enhancement)
   - Improve file rendering method
   - Add file previews
   - Enhance validation

4. **Phase 4: Testing & Refinement** (Quality)
   - Test all functionality
   - Fix any issues
   - Refine UI/UX

## 8. Success Criteria

- ✅ All settings are in sidebar
- ✅ Multimodal inputs are always visible and functional
- ✅ Files are rendered properly with previews
- ✅ Image and audio input can be enabled/disabled
- ✅ Settings are well-organized and intuitive
- ✅ No regressions in existing functionality