automedbench-seg-task-image
Pre-built Docker image for the AutoMedBench segmentation benchmark. Self-contained — Qwen2.5-32B-Instruct-AWQ judge weights, harness code, and 5 hosted task datasets (kidney, liver, pancreas, aeropath, tsg-multiorgan) all baked into one image. No data fetches at runtime.
One-click run
Paste this whole block into a fresh empty directory.
# 0. inputs — only these two are mandatory
export HF_TOKEN=<your hf token> # for the first-time image pull
export API_KEY=<your LLM provider key> # for the agent under test
# optional defaults; change to switch task or tier:
export AGENT=claude-opus-4-7
export TASK=kidney # kidney | liver | pancreas | aeropath | tsg-multiorgan | feta | pancreas-oar
export TIER=lite # lite | standard | pro
# 1. fetch this repo (image + compose + secrets template + this README)
pip install -q -U huggingface_hub
hf download MitakaKuma/automedbench-seg-task-image \
--repo-type model --local-dir ./automedbench-seg-task-image
cd automedbench-seg-task-image
# 2. verify integrity + load the image (≈ 5 min, one-time)
sha256sum -c image.tar.sha256
docker load < image.tar
# 3. write secrets file (chmod 600, gitignored, only bind-mounted to the bench container)
mkdir -p eval_seg
cat > eval_seg/secrets.yaml <<YAML
api_keys:
nvidia_inference: $API_KEY
openrouter: $API_KEY
hf_token: $HF_TOKEN
YAML
chmod 600 eval_seg/secrets.yaml
# 4. go
mkdir -p runs
docker compose up --abort-on-container-exit --exit-code-from bench
Result: runs/<your-user>/bench-<AGENT>-<TASK>-<TIER>/<run-id>/detail_report.json.
If your
API_KEYis for a provider other than NVIDIA Inference (e.g. you want to run thekimik2.5agent via OpenRouter), openeval_seg/secrets.yamlin your editor and put the key on the matching line — the template has slots fornvidia_inferenceandopenrouter.
Tasks (TASK=...)
TASK |
Modality | Source | Data location |
|---|---|---|---|
kidney |
CT | KiTS19 | hosted in image |
liver |
CT | LiTS | hosted in image |
pancreas |
CT | PanTS | hosted in image |
aeropath |
CT | AeroPath | hosted in image |
tsg-multiorgan |
CT | TotalSegmentator | hosted in image |
feta |
MRI T2w | FeTA Challenge | bring your own (license-locked) |
pancreas-oar |
CT | PanTS multi-organ | bring your own (license-locked) |
For BYO datasets, place files at data/<DataDirName>/{public,private}/<patient_id>/ next to docker-compose.yml. The exact DataDirName per task lives in the harness's task config.
Tiers (TIER=...)
TIER |
What's tested |
|---|---|
lite |
Follow a known recipe — exact model named, requirements.txt provided, S1–S3 skill hints. |
standard |
Pick within bounds — 2–5 candidate model families given, S1 hint only. |
pro |
Design from clinical context — no model hints, no requirements.txt. |
The same TASK across all three tiers is the sharpest measure of how
much of the research pipeline an agent can do unassisted.
Hardware
- 1× NVIDIA GPU with ≥ 80 GB VRAM (H100 or A100 80GB)
- ~100 GB free disk for first-time pull + load
- Docker 24+ with NVIDIA Container Toolkit
- Linux
Files in this repo
| File | Purpose |
|---|---|
image.tar (≈ 37 GB) |
Docker image: harness + Qwen-32B judge weights + 5 datasets, all baked in |
image.tar.sha256 |
Integrity checksum (sha256sum -c from this dir) |
docker-compose.yml |
Two-service orchestration (judge + bench) — no edits needed |
secrets.yaml.example |
Template for eval_seg/secrets.yaml (your provider key + HF token) |
README.md |
This file |
Safety notes
- No secrets baked into the image. Your
API_KEYandHF_TOKENstay on the host ineval_seg/secrets.yaml(chmod 600), bind-mounted read-only into the bench container. - Agent task-specific deps (
nnunetv2,torchio,totalsegmentator) are NOT pre-installed. The agent under test installs them itself viapip install -r requirements.txt— that step is part of the S2 rubric, so pre-baking would short-circuit the score.
Reference
- Source repo & docs: https://github.com/AutoMedBench/AutoMedBench
- Scoring rubric & tier definitions: https://github.com/AutoMedBench/AutoMedBench/blob/main/docs/task-difficulty-tiers.md