YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Meta-Harness: End-to-End Optimization of Model Harnesses

Implementation of the paper "Meta-Harness: End-to-End Optimization of Model Harnesses" by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn.

What is Meta-Harness?

The performance of LLM systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Meta-Harness automates the design of these harnesses through an outer-loop search over harness code.

Key Insight

Rather than compressing feedback into scalar scores or short summaries (as prior text optimizers do), Meta-Harness gives its proposer full filesystem access to source code, scores, and execution traces of all prior candidates. This enables causal reasoning over failures—not just knowing that a harness failed, but why.

Architecture

┌─────────────────────────────────────────────────┐
│                 Meta-Harness Loop                │
│                                                  │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐  │
│  │ Proposer  │───>│ Validator │───>│Evaluator │  │
│  │ (LLM     │    │ (Quick    │    │(Full     │  │
│  │  Agent)   │    │  check)   │    │ benchmark│  │
│  └──────────┘    └───────────┘    └──────────┘  │
│       ▲                                │         │
│       │         ┌──────────┐           │         │
│       └─────────│Filesystem│<──────────┘         │
│                 │    D     │                     │
│                 │ (code,   │                     │
│                 │  scores, │                     │
│                 │  traces) │                     │
│                 └──────────┘                     │
└─────────────────────────────────────────────────┘

Algorithm 1 (from the paper):

Initialize population H with baseline harnesses
Evaluate each baseline → store in filesystem D
For t = 1...N iterations:
- Proposer inspects filesystem D (code, scores, traces)
- Proposer proposes k new harnesses
- Validate each harness (quick interface check)
- Evaluate valid harnesses on search set
- Add results to filesystem D
Return Pareto frontier of discovered harnesses

Installation

pip install -e .

Or install dependencies directly:

pip install openai datasets huggingface_hub

Quick Start

1. Test with Mock Model (no API key needed)

python -m meta_harness.demo --mock --iterations 2

2. Run with OpenAI API

export OPENAI_API_KEY=your_key
python -m meta_harness.demo \
    --provider openai \
    --model gpt-4o-mini \
    --iterations 10 \
    --candidates 2

3. Run with HuggingFace Inference API

export HF_TOKEN=your_token
python -m meta_harness.demo \
    --provider hf_chat \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --iterations 10

4. Python API

from meta_harness.search import MetaHarnessSearch
from meta_harness.proposer import LLMProposer

# Define your model function
def model_fn(prompt: str) -> str:
    # Your LLM call here
    return response

# Setup
labels = ["disease_a", "disease_b", ...]
train_data = [("symptoms...", "disease_a"), ...]
test_data = [("symptoms...", "disease_b"), ...]

# Create search
search = MetaHarnessSearch(
    model_fn=model_fn,
    labels=labels,
    train_data=train_data,
    test_data=test_data,
    proposer=LLMProposer(model="gpt-4o"),
    iterations=20,
    candidates_per_iteration=2,
)

# Initialize with baselines and run
search.initialize_population()
results = search.run_search()

# Best discovered harness
print(f"Best accuracy: {results['best_harness']['test_scores']['accuracy']:.4f}")
print(f"Best harness code:\n{results['best_harness']['source_code']}")

Project Structure

meta_harness/
├── __init__.py              # Package init
├── __main__.py              # Entry point for python -m meta_harness
├── harness_interface.py     # Base class for harnesses (ExecutionTrace, EvaluationResult)
├── filesystem.py            # Filesystem D - stores code, scores, traces
├── proposer.py              # Agentic LLM proposer with skill template
├── evaluator.py             # Harness evaluation + quick validation
├── baselines.py             # Baseline harnesses (zero-shot, few-shot, TF-IDF, etc.)
├── search.py                # Main Meta-Harness search loop (Algorithm 1)
└── demo.py                  # Demo script with Symptom2Disease dataset
tests/
└── test_core.py             # Unit tests for all components

Components

Harness Interface (`harness_interface.py`)

Every harness must implement:

__init__(model_fn, labels) — receive the LLM callable and valid labels
update(x, y) — update internal state with a labeled example
predict(x) → str — predict the label for a new input
reset() — reset to initial state

Filesystem (`filesystem.py`)

Manages the growing workspace D that the proposer queries:

Hierarchical directory structure with code, scores, and traces
Pareto frontier computation (accuracy vs. context tokens)
CLI simulation for the proposer (pareto, top N, diff, show, scores)

Proposer (`proposer.py`)

The agentic LLM that generates new harness candidates:

Skill template that defines the proposer's role and constraints
Full context: filesystem summary, top harnesses with code, failure examples
Code extraction and validation from LLM responses
Retry logic for invalid harnesses

Evaluator (`evaluator.py`)

Runs harness candidates and records results:

Safe code loading (dynamic module creation)
Online evaluation protocol (update → predict)
Execution trace recording for proposer diagnosis
Quick validation before expensive evaluation

Baselines (`baselines.py`)

Initial population for search:

Zero-Shot: No examples, just label list
Few-Shot: Random sample of recent examples
TF-IDF Retrieval: Retrieve similar examples by TF-IDF similarity
Draft Verification: Two-call procedure with confirmers/challengers (from Appendix B.1)
Label-Primed Query: Coverage + contrastive pairs (highest accuracy variant from paper)

Configuration

Key parameters from the paper:

Parameter	Paper Value	Default
Iterations	20	10
Candidates/iteration	2-3	2
Search set size	50-100	50
Population init	4 baselines	5 baselines
Proposer	Claude Opus 4.6	gpt-4o-mini

Paper Results (for reference)

Online Text Classification (3 datasets, GPT-OSS-120B)

Method	Avg Accuracy	Context (K tokens)
Zero-Shot	27.4	0
Few-Shot (all)	40.8	12.3
ACE	40.9	50.8
MCE	40.0	28.5
Meta-Harness	48.6	11.4

vs. Text Optimizers (search set)

Method	Median	Best
OpenEvolve	39.1	43.3
TTT-Discover	34.1	45.6
Meta-Harness	50.0	56.7

Implementation Notes

Practical Tips (from Appendix D)

Write a good skill — the skill text is the strongest lever on search quality
Start with hard examples — build the search set from examples the baseline gets wrong
Log everything — JSON format, hierarchical, consistent naming for grep/regex
Make logs queryable — CLI for Pareto frontier, top-k, diffs
Validate before evaluating — catch malformed harnesses in seconds
3-5 debug iterations — refine the skill before a full 20-iteration run

Key Ablation Finding

Full execution traces are critical. Scores-only reaches 34.6 median accuracy, scores+summary reaches 34.9, but full traces reach 50.0 — a 15+ point gap.

Citation

@article{lee2026metaharness,
  title={Meta-Harness: End-to-End Optimization of Model Harnesses},
  author={Lee, Yoonho and Nair, Roshen and Zhang, Qizheng and Lee, Kangwook and Khattab, Omar and Finn, Chelsea},
  journal={arXiv preprint arXiv:2603.28052},
  year={2026}
}

License

This is a research implementation. Please cite the original paper if you use this code.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for dkhanal/meta-harness

Meta-Harness: End-to-End Optimization of Model Harnesses

Paper • 2603.28052 • Published Mar 30 • 20

Meta-Harness: End-to-End Optimization of Model Harnesses

What is Meta-Harness?

Key Insight

Architecture

Installation

Quick Start

1. Test with Mock Model (no API key needed)

2. Run with OpenAI API

3. Run with HuggingFace Inference API

4. Python API

Project Structure

Components

Harness Interface (harness_interface.py)

Filesystem (filesystem.py)

Proposer (proposer.py)

Evaluator (evaluator.py)

Baselines (baselines.py)

Configuration

Paper Results (for reference)

Online Text Classification (3 datasets, GPT-OSS-120B)

vs. Text Optimizers (search set)

Implementation Notes

Practical Tips (from Appendix D)

Key Ablation Finding

Citation

License

Paper for dkhanal/meta-harness

Harness Interface (`harness_interface.py`)

Filesystem (`filesystem.py`)

Proposer (`proposer.py`)

Evaluator (`evaluator.py`)

Baselines (`baselines.py`)