You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

automedbench-seg-task-image

Pre-built Docker image for the AutoMedBench segmentation benchmark. Self-contained — Qwen2.5-32B-Instruct-AWQ judge weights, harness code, and 5 hosted task datasets (kidney, liver, pancreas, aeropath, tsg-multiorgan) all baked into one image. No data fetches at runtime.

One-click run

Paste this whole block into a fresh empty directory.

# 0. inputs — only these two are mandatory
export HF_TOKEN=<your hf token>           # for the first-time image pull
export API_KEY=<your LLM provider key>    # for the agent under test

# optional defaults; change to switch task or tier:
export AGENT=claude-opus-4-7
export TASK=kidney      # kidney | liver | pancreas | aeropath | tsg-multiorgan | feta | pancreas-oar
export TIER=lite        # lite | standard | pro

# 1. fetch this repo (image + compose + secrets template + this README)
pip install -q -U huggingface_hub
hf download MitakaKuma/automedbench-seg-task-image \
    --repo-type model --local-dir ./automedbench-seg-task-image
cd automedbench-seg-task-image

# 2. verify integrity + load the image (≈ 5 min, one-time)
sha256sum -c image.tar.sha256
docker load < image.tar

# 3. write secrets file (chmod 600, gitignored, only bind-mounted to the bench container)
mkdir -p eval_seg
cat > eval_seg/secrets.yaml <<YAML
api_keys:
  nvidia_inference: $API_KEY
  openrouter: $API_KEY
hf_token: $HF_TOKEN
YAML
chmod 600 eval_seg/secrets.yaml

# 4. go
mkdir -p runs
docker compose up --abort-on-container-exit --exit-code-from bench

Result: runs/<your-user>/bench-<AGENT>-<TASK>-<TIER>/<run-id>/detail_report.json.

If your API_KEY is for a provider other than NVIDIA Inference (e.g. you want to run the kimik2.5 agent via OpenRouter), open eval_seg/secrets.yaml in your editor and put the key on the matching line — the template has slots for nvidia_inference and openrouter.

Tasks (`TASK=...`)

`TASK`	Modality	Source	Data location
`kidney`	CT	KiTS19	hosted in image
`liver`	CT	LiTS	hosted in image
`pancreas`	CT	PanTS	hosted in image
`aeropath`	CT	AeroPath	hosted in image
`tsg-multiorgan`	CT	TotalSegmentator	hosted in image
`feta`	MRI T2w	FeTA Challenge	bring your own (license-locked)
`pancreas-oar`	CT	PanTS multi-organ	bring your own (license-locked)

For BYO datasets, place files at data/<DataDirName>/{public,private}/<patient_id>/ next to docker-compose.yml. The exact DataDirName per task lives in the harness's task config.

Tiers (`TIER=...`)

`TIER`	What's tested
`lite`	Follow a known recipe — exact model named, `requirements.txt` provided, S1–S3 skill hints.
`standard`	Pick within bounds — 2–5 candidate model families given, S1 hint only.
`pro`	Design from clinical context — no model hints, no requirements.txt.

The same TASK across all three tiers is the sharpest measure of how much of the research pipeline an agent can do unassisted.

Hardware

1× NVIDIA GPU with ≥ 80 GB VRAM (H100 or A100 80GB)
~100 GB free disk for first-time pull + load
Docker 24+ with NVIDIA Container Toolkit
Linux

Files in this repo

File	Purpose
`image.tar` (≈ 37 GB)	Docker image: harness + Qwen-32B judge weights + 5 datasets, all baked in
`image.tar.sha256`	Integrity checksum (`sha256sum -c` from this dir)
`docker-compose.yml`	Two-service orchestration (judge + bench) — no edits needed
`secrets.yaml.example`	Template for `eval_seg/secrets.yaml` (your provider key + HF token)
`README.md`	This file

Safety notes

No secrets baked into the image. Your API_KEY and HF_TOKEN stay on the host in eval_seg/secrets.yaml (chmod 600), bind-mounted read-only into the bench container.
Agent task-specific deps (nnunetv2, torchio, totalsegmentator) are NOT pre-installed. The agent under test installs them itself via pip install -r requirements.txt — that step is part of the S2 rubric, so pre-baking would short-circuit the score.

Reference

Source repo & docs: https://github.com/AutoMedBench/AutoMedBench
Scoring rubric & tier definitions: https://github.com/AutoMedBench/AutoMedBench/blob/main/docs/task-difficulty-tiers.md

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support