You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

automedbench-seg-task-image

Pre-built Docker image for the AutoMedBench segmentation benchmark. Self-contained — Qwen2.5-32B-Instruct-AWQ judge weights, harness code, and 5 hosted task datasets (kidney, liver, pancreas, aeropath, tsg-multiorgan) all baked into one image. No data fetches at runtime.

One-click run

Paste this whole block into a fresh empty directory.

# 0. inputs — only these two are mandatory
export HF_TOKEN=<your hf token>           # for the first-time image pull
export API_KEY=<your LLM provider key>    # for the agent under test

# optional defaults; change to switch task or tier:
export AGENT=claude-opus-4-7
export TASK=kidney      # kidney | liver | pancreas | aeropath | tsg-multiorgan | feta | pancreas-oar
export TIER=lite        # lite | standard | pro

# 1. fetch this repo (image + compose + secrets template + this README)
pip install -q -U huggingface_hub
hf download MitakaKuma/automedbench-seg-task-image \
    --repo-type model --local-dir ./automedbench-seg-task-image
cd automedbench-seg-task-image

# 2. verify integrity + load the image (≈ 5 min, one-time)
sha256sum -c image.tar.sha256
docker load < image.tar

# 3. write secrets file (chmod 600, gitignored, only bind-mounted to the bench container)
mkdir -p eval_seg
cat > eval_seg/secrets.yaml <<YAML
api_keys:
  nvidia_inference: $API_KEY
  openrouter: $API_KEY
hf_token: $HF_TOKEN
YAML
chmod 600 eval_seg/secrets.yaml

# 4. go
mkdir -p runs
docker compose up --abort-on-container-exit --exit-code-from bench

Result: runs/<your-user>/bench-<AGENT>-<TASK>-<TIER>/<run-id>/detail_report.json.

If your API_KEY is for a provider other than NVIDIA Inference (e.g. you want to run the kimik2.5 agent via OpenRouter), open eval_seg/secrets.yaml in your editor and put the key on the matching line — the template has slots for nvidia_inference and openrouter.

Tasks (TASK=...)

TASK Modality Source Data location
kidney CT KiTS19 hosted in image
liver CT LiTS hosted in image
pancreas CT PanTS hosted in image
aeropath CT AeroPath hosted in image
tsg-multiorgan CT TotalSegmentator hosted in image
feta MRI T2w FeTA Challenge bring your own (license-locked)
pancreas-oar CT PanTS multi-organ bring your own (license-locked)

For BYO datasets, place files at data/<DataDirName>/{public,private}/<patient_id>/ next to docker-compose.yml. The exact DataDirName per task lives in the harness's task config.

Tiers (TIER=...)

TIER What's tested
lite Follow a known recipe — exact model named, requirements.txt provided, S1–S3 skill hints.
standard Pick within bounds — 2–5 candidate model families given, S1 hint only.
pro Design from clinical context — no model hints, no requirements.txt.

The same TASK across all three tiers is the sharpest measure of how much of the research pipeline an agent can do unassisted.

Hardware

  • 1× NVIDIA GPU with ≥ 80 GB VRAM (H100 or A100 80GB)
  • ~100 GB free disk for first-time pull + load
  • Docker 24+ with NVIDIA Container Toolkit
  • Linux

Files in this repo

File Purpose
image.tar (≈ 37 GB) Docker image: harness + Qwen-32B judge weights + 5 datasets, all baked in
image.tar.sha256 Integrity checksum (sha256sum -c from this dir)
docker-compose.yml Two-service orchestration (judge + bench) — no edits needed
secrets.yaml.example Template for eval_seg/secrets.yaml (your provider key + HF token)
README.md This file

Safety notes

  • No secrets baked into the image. Your API_KEY and HF_TOKEN stay on the host in eval_seg/secrets.yaml (chmod 600), bind-mounted read-only into the bench container.
  • Agent task-specific deps (nnunetv2, torchio, totalsegmentator) are NOT pre-installed. The agent under test installs them itself via pip install -r requirements.txt — that step is part of the S2 rubric, so pre-baking would short-circuit the score.

Reference

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support