Linux File Search Query Formatter Model
Model Overview
This model is a Query Formatter trained on the Linux File Search Dataset.
It maps natural language file search queries into a structured JSON-like representation of file attributes based on a fixed schema.
Key Features:
- Converts NL queries → structured tag–value pairs
- Supports all schema attributes from the Linux File Search NLI dataset:
- File attributes (
file_type, extension, size_kb, owner, group, permissions)
- Temporal attributes (
created_year, modified_year)
- Semantic attributes (
language, purpose, contains_text, is_executable, hidden)
- Path scope and generic tags (
path_scope, important, autogenerated, obsolete, archived)
- Outputs deterministic JSON suitable for safe post-processing into
find commands or other Linux search engines
Intended Use
Recommended:
- Formatting natural language queries into structured representations
- Query-to-Structure pipelines for semantic file search
- Integration with safe Linux CLI search tools (
find, grep, fd)
- Training downstream Q2I or NLI models
Not Recommended:
- Direct command execution without validation
- General-purpose conversation
- Use outside Linux file systems without adaptation
Model Architecture
- Type: Decoder-only (seq2seq transformer)
- Input: Natural language query
- Output: JSON-like structured representation (tag:value pairs)
- Precision: bf16
- Training Dataset: Linux File Search NLI (~3500 synthetic examples)
- Training Objective: Map NL queries → structured schema attributes
Limitations
- English-only queries
- Linux-centric file system abstraction
- Temporal reasoning limited to years
- Logical operators may require post-processing
- Does not execute commands
Safety Considerations
- Outputs are structured representations, not shell commands
- Any conversion to executable commands should be validated and sandboxed
- Prevent execution of arbitrary system commands from model output