YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Meta-Harness: End-to-End Optimization of Model Harnesses

Implementation of the paper "Meta-Harness: End-to-End Optimization of Model Harnesses" by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn.

What is Meta-Harness?

The performance of LLM systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Meta-Harness automates the design of these harnesses through an outer-loop search over harness code.

Key Insight

Rather than compressing feedback into scalar scores or short summaries (as prior text optimizers do), Meta-Harness gives its proposer full filesystem access to source code, scores, and execution traces of all prior candidates. This enables causal reasoning over failuresβ€”not just knowing that a harness failed, but why.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Meta-Harness Loop                β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Proposer  │───>β”‚ Validator │───>β”‚Evaluator β”‚  β”‚
β”‚  β”‚ (LLM     β”‚    β”‚ (Quick    β”‚    β”‚(Full     β”‚  β”‚
β”‚  β”‚  Agent)   β”‚    β”‚  check)   β”‚    β”‚ benchmarkβ”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚       β–²                                β”‚         β”‚
β”‚       β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚         β”‚
β”‚       └─────────│Filesystemβ”‚<β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                 β”‚    D     β”‚                     β”‚
β”‚                 β”‚ (code,   β”‚                     β”‚
β”‚                 β”‚  scores, β”‚                     β”‚
β”‚                 β”‚  traces) β”‚                     β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Algorithm 1 (from the paper):

  1. Initialize population H with baseline harnesses
  2. Evaluate each baseline β†’ store in filesystem D
  3. For t = 1...N iterations:
    • Proposer inspects filesystem D (code, scores, traces)
    • Proposer proposes k new harnesses
    • Validate each harness (quick interface check)
    • Evaluate valid harnesses on search set
    • Add results to filesystem D
  4. Return Pareto frontier of discovered harnesses

Installation

pip install -e .

Or install dependencies directly:

pip install openai datasets huggingface_hub

Quick Start

1. Test with Mock Model (no API key needed)

python -m meta_harness.demo --mock --iterations 2

2. Run with OpenAI API

export OPENAI_API_KEY=your_key
python -m meta_harness.demo \
    --provider openai \
    --model gpt-4o-mini \
    --iterations 10 \
    --candidates 2

3. Run with HuggingFace Inference API

export HF_TOKEN=your_token
python -m meta_harness.demo \
    --provider hf_chat \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --iterations 10

4. Python API

from meta_harness.search import MetaHarnessSearch
from meta_harness.proposer import LLMProposer

# Define your model function
def model_fn(prompt: str) -> str:
    # Your LLM call here
    return response

# Setup
labels = ["disease_a", "disease_b", ...]
train_data = [("symptoms...", "disease_a"), ...]
test_data = [("symptoms...", "disease_b"), ...]

# Create search
search = MetaHarnessSearch(
    model_fn=model_fn,
    labels=labels,
    train_data=train_data,
    test_data=test_data,
    proposer=LLMProposer(model="gpt-4o"),
    iterations=20,
    candidates_per_iteration=2,
)

# Initialize with baselines and run
search.initialize_population()
results = search.run_search()

# Best discovered harness
print(f"Best accuracy: {results['best_harness']['test_scores']['accuracy']:.4f}")
print(f"Best harness code:\n{results['best_harness']['source_code']}")

Project Structure

meta_harness/
β”œβ”€β”€ __init__.py              # Package init
β”œβ”€β”€ __main__.py              # Entry point for python -m meta_harness
β”œβ”€β”€ harness_interface.py     # Base class for harnesses (ExecutionTrace, EvaluationResult)
β”œβ”€β”€ filesystem.py            # Filesystem D - stores code, scores, traces
β”œβ”€β”€ proposer.py              # Agentic LLM proposer with skill template
β”œβ”€β”€ evaluator.py             # Harness evaluation + quick validation
β”œβ”€β”€ baselines.py             # Baseline harnesses (zero-shot, few-shot, TF-IDF, etc.)
β”œβ”€β”€ search.py                # Main Meta-Harness search loop (Algorithm 1)
└── demo.py                  # Demo script with Symptom2Disease dataset
tests/
└── test_core.py             # Unit tests for all components

Components

Harness Interface (harness_interface.py)

Every harness must implement:

  • __init__(model_fn, labels) β€” receive the LLM callable and valid labels
  • update(x, y) β€” update internal state with a labeled example
  • predict(x) β†’ str β€” predict the label for a new input
  • reset() β€” reset to initial state

Filesystem (filesystem.py)

Manages the growing workspace D that the proposer queries:

  • Hierarchical directory structure with code, scores, and traces
  • Pareto frontier computation (accuracy vs. context tokens)
  • CLI simulation for the proposer (pareto, top N, diff, show, scores)

Proposer (proposer.py)

The agentic LLM that generates new harness candidates:

  • Skill template that defines the proposer's role and constraints
  • Full context: filesystem summary, top harnesses with code, failure examples
  • Code extraction and validation from LLM responses
  • Retry logic for invalid harnesses

Evaluator (evaluator.py)

Runs harness candidates and records results:

  • Safe code loading (dynamic module creation)
  • Online evaluation protocol (update β†’ predict)
  • Execution trace recording for proposer diagnosis
  • Quick validation before expensive evaluation

Baselines (baselines.py)

Initial population for search:

  1. Zero-Shot: No examples, just label list
  2. Few-Shot: Random sample of recent examples
  3. TF-IDF Retrieval: Retrieve similar examples by TF-IDF similarity
  4. Draft Verification: Two-call procedure with confirmers/challengers (from Appendix B.1)
  5. Label-Primed Query: Coverage + contrastive pairs (highest accuracy variant from paper)

Configuration

Key parameters from the paper:

Parameter Paper Value Default
Iterations 20 10
Candidates/iteration 2-3 2
Search set size 50-100 50
Population init 4 baselines 5 baselines
Proposer Claude Opus 4.6 gpt-4o-mini

Paper Results (for reference)

Online Text Classification (3 datasets, GPT-OSS-120B)

Method Avg Accuracy Context (K tokens)
Zero-Shot 27.4 0
Few-Shot (all) 40.8 12.3
ACE 40.9 50.8
MCE 40.0 28.5
Meta-Harness 48.6 11.4

vs. Text Optimizers (search set)

Method Median Best
OpenEvolve 39.1 43.3
TTT-Discover 34.1 45.6
Meta-Harness 50.0 56.7

Implementation Notes

Practical Tips (from Appendix D)

  1. Write a good skill β€” the skill text is the strongest lever on search quality
  2. Start with hard examples β€” build the search set from examples the baseline gets wrong
  3. Log everything β€” JSON format, hierarchical, consistent naming for grep/regex
  4. Make logs queryable β€” CLI for Pareto frontier, top-k, diffs
  5. Validate before evaluating β€” catch malformed harnesses in seconds
  6. 3-5 debug iterations β€” refine the skill before a full 20-iteration run

Key Ablation Finding

Full execution traces are critical. Scores-only reaches 34.6 median accuracy, scores+summary reaches 34.9, but full traces reach 50.0 β€” a 15+ point gap.

Citation

@article{lee2026metaharness,
  title={Meta-Harness: End-to-End Optimization of Model Harnesses},
  author={Lee, Yoonho and Nair, Roshen and Zhang, Qizheng and Lee, Kangwook and Khattab, Omar and Finn, Chelsea},
  journal={arXiv preprint arXiv:2603.28052},
  year={2026}
}

License

This is a research implementation. Please cite the original paper if you use this code.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for dkhanal/meta-harness