YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Meta-Harness: End-to-End Optimization of Model Harnesses
Implementation of the paper "Meta-Harness: End-to-End Optimization of Model Harnesses" by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn.
What is Meta-Harness?
The performance of LLM systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Meta-Harness automates the design of these harnesses through an outer-loop search over harness code.
Key Insight
Rather than compressing feedback into scalar scores or short summaries (as prior text optimizers do), Meta-Harness gives its proposer full filesystem access to source code, scores, and execution traces of all prior candidates. This enables causal reasoning over failuresβnot just knowing that a harness failed, but why.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Meta-Harness Loop β
β β
β ββββββββββββ βββββββββββββ ββββββββββββ β
β β Proposer ββββ>β Validator ββββ>βEvaluator β β
β β (LLM β β (Quick β β(Full β β
β β Agent) β β check) β β benchmarkβ β
β ββββββββββββ βββββββββββββ ββββββββββββ β
β β² β β
β β ββββββββββββ β β
β βββββββββββFilesystemβ<βββββββββββ β
β β D β β
β β (code, β β
β β scores, β β
β β traces) β β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Algorithm 1 (from the paper):
- Initialize population
Hwith baseline harnesses - Evaluate each baseline β store in filesystem
D - For
t = 1...Niterations:- Proposer inspects filesystem
D(code, scores, traces) - Proposer proposes
knew harnesses - Validate each harness (quick interface check)
- Evaluate valid harnesses on search set
- Add results to filesystem
D
- Proposer inspects filesystem
- Return Pareto frontier of discovered harnesses
Installation
pip install -e .
Or install dependencies directly:
pip install openai datasets huggingface_hub
Quick Start
1. Test with Mock Model (no API key needed)
python -m meta_harness.demo --mock --iterations 2
2. Run with OpenAI API
export OPENAI_API_KEY=your_key
python -m meta_harness.demo \
--provider openai \
--model gpt-4o-mini \
--iterations 10 \
--candidates 2
3. Run with HuggingFace Inference API
export HF_TOKEN=your_token
python -m meta_harness.demo \
--provider hf_chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--iterations 10
4. Python API
from meta_harness.search import MetaHarnessSearch
from meta_harness.proposer import LLMProposer
# Define your model function
def model_fn(prompt: str) -> str:
# Your LLM call here
return response
# Setup
labels = ["disease_a", "disease_b", ...]
train_data = [("symptoms...", "disease_a"), ...]
test_data = [("symptoms...", "disease_b"), ...]
# Create search
search = MetaHarnessSearch(
model_fn=model_fn,
labels=labels,
train_data=train_data,
test_data=test_data,
proposer=LLMProposer(model="gpt-4o"),
iterations=20,
candidates_per_iteration=2,
)
# Initialize with baselines and run
search.initialize_population()
results = search.run_search()
# Best discovered harness
print(f"Best accuracy: {results['best_harness']['test_scores']['accuracy']:.4f}")
print(f"Best harness code:\n{results['best_harness']['source_code']}")
Project Structure
meta_harness/
βββ __init__.py # Package init
βββ __main__.py # Entry point for python -m meta_harness
βββ harness_interface.py # Base class for harnesses (ExecutionTrace, EvaluationResult)
βββ filesystem.py # Filesystem D - stores code, scores, traces
βββ proposer.py # Agentic LLM proposer with skill template
βββ evaluator.py # Harness evaluation + quick validation
βββ baselines.py # Baseline harnesses (zero-shot, few-shot, TF-IDF, etc.)
βββ search.py # Main Meta-Harness search loop (Algorithm 1)
βββ demo.py # Demo script with Symptom2Disease dataset
tests/
βββ test_core.py # Unit tests for all components
Components
Harness Interface (harness_interface.py)
Every harness must implement:
__init__(model_fn, labels)β receive the LLM callable and valid labelsupdate(x, y)β update internal state with a labeled examplepredict(x)βstrβ predict the label for a new inputreset()β reset to initial state
Filesystem (filesystem.py)
Manages the growing workspace D that the proposer queries:
- Hierarchical directory structure with code, scores, and traces
- Pareto frontier computation (accuracy vs. context tokens)
- CLI simulation for the proposer (
pareto,top N,diff,show,scores)
Proposer (proposer.py)
The agentic LLM that generates new harness candidates:
- Skill template that defines the proposer's role and constraints
- Full context: filesystem summary, top harnesses with code, failure examples
- Code extraction and validation from LLM responses
- Retry logic for invalid harnesses
Evaluator (evaluator.py)
Runs harness candidates and records results:
- Safe code loading (dynamic module creation)
- Online evaluation protocol (update β predict)
- Execution trace recording for proposer diagnosis
- Quick validation before expensive evaluation
Baselines (baselines.py)
Initial population for search:
- Zero-Shot: No examples, just label list
- Few-Shot: Random sample of recent examples
- TF-IDF Retrieval: Retrieve similar examples by TF-IDF similarity
- Draft Verification: Two-call procedure with confirmers/challengers (from Appendix B.1)
- Label-Primed Query: Coverage + contrastive pairs (highest accuracy variant from paper)
Configuration
Key parameters from the paper:
| Parameter | Paper Value | Default |
|---|---|---|
| Iterations | 20 | 10 |
| Candidates/iteration | 2-3 | 2 |
| Search set size | 50-100 | 50 |
| Population init | 4 baselines | 5 baselines |
| Proposer | Claude Opus 4.6 | gpt-4o-mini |
Paper Results (for reference)
Online Text Classification (3 datasets, GPT-OSS-120B)
| Method | Avg Accuracy | Context (K tokens) |
|---|---|---|
| Zero-Shot | 27.4 | 0 |
| Few-Shot (all) | 40.8 | 12.3 |
| ACE | 40.9 | 50.8 |
| MCE | 40.0 | 28.5 |
| Meta-Harness | 48.6 | 11.4 |
vs. Text Optimizers (search set)
| Method | Median | Best |
|---|---|---|
| OpenEvolve | 39.1 | 43.3 |
| TTT-Discover | 34.1 | 45.6 |
| Meta-Harness | 50.0 | 56.7 |
Implementation Notes
Practical Tips (from Appendix D)
- Write a good skill β the skill text is the strongest lever on search quality
- Start with hard examples β build the search set from examples the baseline gets wrong
- Log everything β JSON format, hierarchical, consistent naming for grep/regex
- Make logs queryable β CLI for Pareto frontier, top-k, diffs
- Validate before evaluating β catch malformed harnesses in seconds
- 3-5 debug iterations β refine the skill before a full 20-iteration run
Key Ablation Finding
Full execution traces are critical. Scores-only reaches 34.6 median accuracy, scores+summary reaches 34.9, but full traces reach 50.0 β a 15+ point gap.
Citation
@article{lee2026metaharness,
title={Meta-Harness: End-to-End Optimization of Model Harnesses},
author={Lee, Yoonho and Nair, Roshen and Zhang, Qizheng and Lee, Kangwook and Khattab, Omar and Finn, Chelsea},
journal={arXiv preprint arXiv:2603.28052},
year={2026}
}
License
This is a research implementation. Please cite the original paper if you use this code.