HWresearch
/

LLM4HEP

Text Generation

English

physics

agent

Model card Files Files and versions

xet

Community

ho22joshua commited on Nov 20, 2025

Commit

242932b

1 Parent(s): 751d271

new readme following paper, renaming original readme to SETUP.md

Browse files

Files changed (2) hide show

README.md +87 -396
SETUP.md +448 -0

README.md CHANGED Viewed

@@ -1,448 +1,139 @@
-# Large Language Model Analysis Framework for High Energy Physics
-A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
-## Table of Contents
-- [Setup](#setup)
-- [Data and Solution](#data-and-solution)
-- [Running Tests](#running-tests)
-- [Analysis and Visualization](#analysis-and-visualization)
-- [Project Structure](#project-structure)
-- [Advanced Usage](#advanced-usage)
----
-## Setup
-### Prerequisites
-**CBORG API Access Required**
-This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
-1. Access to the CBORG API (contact LBL for access)
-2. A CBORG API key
-3. Network access to the CBORG API endpoint
-**Note for External Users:** CBORG is an internal LBL system. External users may need to:
-- Request guest access through LBL collaborations
-- Adapt the code to use OpenAI API directly (requires code modifications)
-- Contact the repository maintainers for alternative deployment options
-### Environment Setup
-Create Conda environment:
-```bash
-mamba env create -f environment.yml
-conda activate llm_env
-```
-### API Configuration
-Create script `~/.apikeys.sh` to export CBORG API key:
-```bash
-export CBORG_API_KEY="INSERT_API_KEY"
-```
-Then source it before running tests:
-```bash
-source ~/.apikeys.sh
-```
-### Initial Configuration
-Before running tests, set up your configuration files:
-```bash
-# Copy example configuration files
-cp config.example.yml config.yml
-cp models.example.txt models.txt
-# Edit config.yml to set your preferred models and parameters
-# Edit models.txt to list models you want to test
-```
-**Important:** The `models.txt` file must end with a blank line.
----
-## Data and Solution
-### ATLAS Open Data Samples
-All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
-```
-/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
-```
-**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
-```bash
-chmod -R a-w /path/to/data/directory
-```
-### Reference Solution
-- Navigate to `solution/` directory and run `python soln.py`
-- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
-### Reference Arrays for Validation
-Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
-**Quick fetch from repo root:**
-```bash
-bash scripts/fetch_solution_arrays.sh
-```
-**Or copy from NERSC shared path:**
-```
-/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
-```
----
-## Running Tests
-### Model Configuration
-Three model list files control testing:
-- **`models.txt`**: Models for sequential testing
-- **`models_supervisor.txt`**: Supervisor models for paired testing
-- **`models_coder.txt`**: Coder models for paired testing
-**Important formatting rules:**
-- One model per line
-- File must end with a blank line
-- Repeat model names for multiple trials
-- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
-See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
-### Testing Workflows
-#### 1. Sequential Testing (Single Model at a Time)
-```bash
-bash test_models.sh output_dir_name
-```
-Tests all models in `models.txt` sequentially.
-#### 2. Parallel Testing (Multiple Models)
-```bash
-# Basic parallel execution
-bash test_models_parallel.sh output_dir_name
-# GNU Parallel (recommended for large-scale testing)
-bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
-# Examples:
-bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
-bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
-bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each
-```
-**GNU Parallel features:**
-- Scales to 20-30 models with 200-300 total parallel jobs
-- Automatic resource management
-- Fast I/O using `/dev/shm` temporary workspace
-- Comprehensive error handling and logging
-#### 3. Step-by-Step Testing with Validation
-```bash
-# Run all 5 steps with validation
-./run_smk_sequential.sh --validate
-# Run specific steps
-./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
-# Run individual steps
-./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
-./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
-./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
-./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
-./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization
-# Custom output directory
-./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir
-```
-**Directory naming options:**
-- `--job-id ID`: Creates `results_job_ID/`
-- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
-- `--out-dir DIR`: Custom directory name
-### Validation
-**Automatic validation (during execution):**
-```bash
-./run_smk_sequential.sh --step1 --step2 --validate
-```
-Validation logs saved to `{output_dir}/logs/*_validation.log`
-**Manual validation (after execution):**
-```bash
-# Validate all steps
-python check_soln.py --out_dir results_job_002
-# Validate specific step
-python check_soln.py --out_dir results_job_002 --step 2
-```
-**Validation features:**
-- ✅ Adaptive tolerance with 4 significant digit precision
-- 📊 Column-by-column difference analysis
-- 📋 Side-by-side value comparison
-- 🎯 Clear, actionable error messages
-### Speed Optimization
-Reduce iteration counts in `config.yml`:
-```yaml
-# Limit LLM coder attempts (default 10)
-max_iterations: 3
-```
----
-## Analysis and Visualization
-### Results Summary
-All test results are aggregated in:
-```
-results_summary.csv
-```
-**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
-### Error Analysis and Categorization
-**Automated error analysis:**
-```bash
-python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
-```
-Uses LLM to analyze comprehensive logs and categorize errors into:
-- Semantic errors
-- Function-calling errors
-- Intermediate file not found
-- Incorrect branch name
-- OpenAI API errors
-- Data quality issues (all weights = 0)
-- Other/uncategorized
-### Interactive Analysis Notebooks
-#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
-Comprehensive analysis of model performance across all 5 workflow steps:
-- **Success rate heatmap** (models × steps)
-- **Agent work progression** (iterations over steps)
-- **API call statistics** (by step and model)
-- **Cost analysis** (input/output tokens, estimated pricing)
-**Output plots:**
-- `plots/1_success_rate_heatmap.pdf`
-- `plots/2_agent_work_line_plot.pdf`
-- `plots/3_api_calls_line_plot.pdf`
-- `plots/4_cost_per_step.pdf`
-- `plots/five_step_summary_stats.csv`
-#### 2. Error Category Analysis (`error_analysis.ipynb`)
-Deep dive into error patterns and failure modes:
-- **Normalized error distribution** (stacked bar chart with percentages)
-- **Error type heatmap** (models × error categories)
-- **Top model breakdowns** (faceted plots for top 9 models)
-- **Error trends across steps** (stacked area chart)
-**Output plots:**
-- `plots/error_distribution_by_model.pdf`
-- `plots/error_heatmap_by_model.pdf`
-- `plots/error_categories_top_models.pdf`
-- `plots/errors_by_step.pdf`
-#### 3. Quick Statistics (`plot_stats.ipynb`)
-Legacy notebook for basic statistics visualization.
-### Log Interpretation
-**Automated log analysis:**
-```bash
-python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
-```
-Analyzes comprehensive supervisor-coder logs to identify:
-- Root causes of failures
-- Responsible parties (user, supervisor, coder, external)
-- Error patterns across iterations
----
-## Project Structure
-### Core Scripts
-- **`supervisor_coder.py`**: Supervisor-coder framework implementation
-- **`check_soln.py`**: Solution validation with enhanced comparison
-- **`write_prompt.py`**: Prompt management and context chaining
-- **`update_stats.py`**: Statistics tracking and CSV updates
-- **`error_analysis.py`**: LLM-powered error categorization
-### Test Runners
-- **`test_models.sh`**: Sequential model testing
-- **`test_models_parallel.sh`**: Parallel testing (basic)
-- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
-- **`test_stats.sh`**: Individual model statistics
-- **`test_stats_parallel.sh`**: Parallel step execution
-- **`run_smk_sequential.sh`**: Step-by-step workflow runner
-### Snakemake Workflows (`workflow/`)
-The analysis workflow is divided into 5 sequential steps:
-1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
-2. **`create_numpy.smk`**: Convert ROOT → NumPy arrays
-3. **`preprocess.smk`**: Apply preprocessing transformations
-4. **`scores.smk`**: Compute signal/background classification scores
-5. **`categorization.smk`**: Final categorization and statistical analysis
-**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
-### Prompts (`prompts/`)
-- `summarize_root.txt`: Step 1 task description
-- `create_numpy.txt`: Step 2 task description
-- `preprocess.txt`: Step 3 task description
-- `scores.txt`: Step 4 task description
-- `categorization.txt`: Step 5 task description
-- `supervisor_first_call.txt`: Initial supervisor instructions
-- `supervisor_call.txt`: Subsequent supervisor instructions
-### Utility Scripts (`util/`)
-- **`inspect_root.py`**: ROOT file inspection tools
-- **`analyze_particles.py`**: Particle-level analysis
-- **`compare_arrays.py`**: NumPy array comparison utilities
-### Model Documentation
-- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias → actual model mappings
-- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
-- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
-- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
-### Analysis Notebooks
-- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
-- **`error_analysis.ipynb`**: Error categorization and pattern analysis
-- **`error_analysis_plotting.ipynb`**: Additional error visualizations
-- **`plot_stats.ipynb`**: Legacy statistics plots
-### Output Structure
-Each test run creates:
-```
-output_name/
-├── model_timestamp/
-│   ├── generated_code/     # LLM-generated Python scripts
-│   ├── logs/               # Execution logs and supervisor records
-│   ├── arrays/             # NumPy arrays produced by generated code
-│   ├── plots/              # Comparison plots (generated vs. solution)
-│   ├── prompt_pairs/       # User + supervisor prompts
-│   ├── results/            # Temporary ROOT files (job-scoped)
-│   └── snakemake_log/      # Snakemake execution logs
-```
-**Job-scoped ROOT outputs:**
-- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
-- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
-- Automatically cleaned after significance calculation
----
-## Advanced Usage
-### Supervisor-Coder Configuration
-Control iteration limits in `config.yml`:
-```yaml
-model: 'anthropic/claude-sonnet:latest'
-name: 'experiment_name'
-out_dir: 'results/experiment_name'
-max_iterations: 10  # Maximum supervisor-coder iterations per step
-```
-### Parallel Execution Tuning
-For `test_models_parallel_gnu.sh`:
-```bash
-# Syntax:
-bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
-# Conservative (safe for shared systems):
-bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs
-# Aggressive (dedicated nodes):
-bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs
-```
-### Custom Validation
-Run validation on specific steps or with custom tolerances:
-```bash
-# Validate only data conversion step
-python check_soln.py --out_dir results/ --step 2
-# Check multiple specific steps
-python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
-```
-### Log Analysis Pipeline
-```bash
-# 1. Run tests
-bash test_models_parallel_gnu.sh experiment1 5 5
-# 2. Analyze logs with LLM
-python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
-# 3. Categorize errors
-python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
-# 4. Generate visualizations
-jupyter notebook error_analysis.ipynb
-```
----
-## Roadmap and Future Directions
-### Planned Improvements
-**Prompt Engineering:**
-- Auto-load context (file lists, logs) at step start
-- Provide comprehensive inputs/outputs/summaries upfront
-- Develop prompt-management layer for cross-analysis reuse
-**Validation & Monitoring:**
-- Embed validation in workflows for immediate error detection
-- Record input/output and state transitions for reproducibility
-- Enhanced situation awareness through comprehensive logging
-**Multi-Analysis Extension:**
-- Rerun H→γγ with improved system prompts
-- Extend to H→4ℓ and other Higgs+X channels
-- Provide learned materials from previous analyses as reference
-**Self-Improvement:**
-- Reinforcement learning–style feedback loops
-- Agent-driven prompt refinement
-- Automatic generalization across HEP analyses
 ---
-## Citation and Acknowledgments
-This framework tests LLM agents on ATLAS Open Data from:
-- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
-Models tested via CBORG API (Lawrence Berkeley National Laboratory).
----
-## Support and Contributing
-For questions or issues:
-1. Check existing documentation in `*.md` files
-2. Review example configurations in `config.yml`
-3. Examine validation logs in output directories
-For contributions, please ensure:
-- Model lists end with blank lines
-- Prompts follow established format
-- Validation passes for all test cases

+## Abstract
+We present a proof-of-principle study demonstrating the use of large language model (LLM) agents to automate a representative high energy physics (HEP) analysis. Using the Higgs boson diphoton cross-section measurement as a case study with ATLAS Open Data, we design a hybrid system that combines an LLM-based supervisor–coder agent with the `Snakemake` workflow manager.
+In this architecture, the workflow manager enforces reproducibility and determinism, while the agent autonomously generates, executes, and iteratively corrects analysis code in response to user instructions. We define quantitative evaluation metrics—success rate and error distribution—to assess agent performance across multi-stage workflows.
+To characterize variability across architectures, we benchmark a representative selection of state-of-the-art LLMs, spanning the *Gemini* and *GPT-5* series, the *Claude* family, and leading open-weight models. While the workflow manager ensures deterministic execution of all analysis steps, the final outputs still show stochastic variation. Although we set the temperature to zero, other sampling parameters (e.g., top-p, top-k) remained at their defaults, and some reasoning-oriented models internally adjust these settings. Consequently, the models do not produce fully deterministic results.
+This study establishes the first LLM-agent-driven automated data-analysis framework in HEP, enabling systematic benchmarking of model capabilities, stability, and limitations in real-world scientific computing environments.
+The baseline code used in this work is [available here](https://huggingface.co/HWresearch/LLM4HEP/tree/main).
+## Introduction
+While large language models (LLMs) and agentic systems have been explored for automating components of scientific discovery in fields such as biology (e.g., CellAgent for single-cell RNA-seq analysis [Xiao et al., 2024](#ref-xiao-et-al-2024)) and software engineering (e.g., LangGraph-based bug fixing agents [Wang & Duan, 2025](#ref-wang-duan-2025)), their application to high energy physics (HEP) remains largely unexplored. In HEP, analysis pipelines are highly structured, resource-intensive, and have strict requirements of reproducibility and statistical rigor. Existing work in HEP has primarily focused on machine learning models for event classification or simulation acceleration, and more recently on domain-adapted LLMs (e.g., FeynTune [Richmond et al., 2025](#ref-richmond-et-al-2025), Astro-HEP-BERT [Simons, 2024](#ref-simons-2024), BBT-Neutron [Wu et al., 2024](#ref-wu-et-al-2024)) and conceptual roadmaps for large physics models. However, to the best of our knowledge, no published study has demonstrated an agentic system that autonomously generates, executes, validates, and iterates on HEP data analysis workflows.
+In this work, we present the first attempt to operationalize LLM-based agents within a reproducible HEP analysis pipeline. Our approach integrates a task-focused agent with the `Snakemake` workflow manager [Mölder et al., 2021](#ref-molder-et-al-2021), leveraging `Snakemake`’s HPC-native execution and file-based provenance to enforce determinism, while delegating bounded subtasks (e.g., event selection, code generation, validation, and self-correction) to an LLM agent. This design departs from prior agent frameworks such as LangChain [Chase, 2022](#ref-chase-2022) or LangGraph [LangChain Team, 2025](#ref-langchain-team-2025) that emphasize flexible multi-step reasoning by embedding the agent within a domain-constrained directed acyclic graph (DAG), ensuring both scientific reliability and AI relevance.
+## Description of a Representative High Energy Physics Analysis
+In this paper, we use a cross-section measurement of the Higgs boson decaying to the diphoton channel at the Large Hadron Collider as an example. We employ collision data and simulation samples from the 2020 ATLAS Open Data release [ATLAS Collaboration, 2020](#ref-atlas-collab-2020). The data sample corresponds to **10 fb⁻¹** of proton-proton collisions collected in 2016 at $\sqrt{s} = 13$ TeV. Higgs boson production is simulated for gluon fusion, vector boson fusion, associated production with a vector boson, and associated production with a top-quark pair.
+Our example analysis uses control samples derived from collision data and Higgs production simulation samples to design a categorized analysis with a machine-learning-based event classifier, similar to the one used by the CMS collaboration in their Higgs observation paper [CMS Collaboration, 2012](#ref-cms-collab-2012). The objective of this analysis is to maximize the expected significance of the Higgs-to-diphoton signal. Since the purpose of this work is to demonstrate the LLM-based agentic implementation of a HEP analysis, we do not report the observed significance. The workflow of this analysis is highly representative in HEP.
+The technical implementation of the analysis workflow is factorized into five sequential steps, executed through the `Snakemake` workflow management system. Each step is designed to test a distinct type of reasoning or code-generation capability, and the evaluation of each step is performed independently, i.e., the success of a given step does not depend on the completion or correctness of the previous one. This design allows for consistent benchmarking across heterogeneous tasks while maintaining deterministic workflow execution.
+**Step 1 (ROOT file inspection):**
+Generates summary text files describing the ROOT files and their internal structure, including trees and branches, to provide the agent with a human-readable overview of the available data.
+**Step 2 (Ntuple conversion):**
+Produces a Python script that reads all particle observables specified in the user prompt from the `TTree` objects in ROOT files [Antcheva et al., 2009](#ref-antcheva-et-al-2009) and converts them into `numpy` arrays for downstream analysis.
+**Step 3 (Preprocessing):**
+Normalizes the signal and background arrays and applies standard selection criteria to prepare the datasets for machine-learning–based classification.
+**Step 4 (S-B separation):**
+Applies TabPFN [Hollmann et al., 2025](#ref-hollmann-et-al-2025), a transformer-based foundation model for tabular data, to perform signal–background separation. The workflow requires a script that calls a provided function to train and evaluate the TabPFN model using the appropriate datasets and hyperparameters.
+**Step 5 (Categorization):**
+Performs statistical categorization of events by defining optimized boundaries on the TabPFN score to maximize the expected significance of the Higgs-to-diphoton signal. This step is implemented as an iterative function-calling procedure, where new category boundaries are added until the improvement in expected significance falls below 5%.
+## Architecture
+We adopt a hybrid approach to automate the Higgs boson to diphoton data analysis. Given the relatively fixed workflow, we use `Snakemake` to orchestrate the sequential execution of analysis steps. A supervisor–coder agent is deployed to complete each step. This design balances the determinism of the analysis structure with the flexibility often required in HEP analyses.
+### Workflow management
+`Snakemake` is a Python-based workflow management system that enables researchers to define computational workflows through rule-based specifications, where each rule describes how to generate specific output files from input files using defined scripts or commands. Snakemake serves as the workflow orchestration engine that manages the complex dependencies and execution order of the HEP analysis pipeline. In this package, `Snakemake` operates as the backbone that coordinates the five-stage analysis workflow for ATLAS diphoton data processing. This modular approach allows the complex physics analysis to be broken down into manageable, interdependent components that can be executed efficiently and reproducibly.
+### Supervisor–coder agent
+We design a supervisor–coder agent to carry out each task, as illustrated in Fig. 1. The supervisor and coder are implemented as API calls to a large language model (LLM), with distinct system prompts tailored to their respective roles. The supervisor receives instructions from the human user, formulates corresponding directives for the coder, and reviews the coder’s output. The coder, in turn, takes the supervisor’s instructions and generates code to implement the required action. The generated code is executed through an execution engine, which records the execution trace.
+The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
+<!-- ![Illustration of internal workflow for the supervisor–coder agent.](supervisor_coder.pdf) -->
+Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
+For the initial user interaction with the supervisor, the input prompt includes a natural-language description of the task objectives, suggested implementation strategies, and a system prompt that constrains the behavior of the model output. The result of this interaction is an instruction passed to the coder, which in turn generates a Python script to be executed. If execution produces an error, the error message, the original supervisor instruction, the generated code, and a debugging system prompt are passed back to the supervisor in a follow-up API call. The supervisor then issues revised instructions to the coder to address the problem. This self-correction loop is repeated up to three times before the trial is deemed unsuccessful.
+## Results
+To establish a baseline, we conducted 219 experiments with the `gemini-pro-2.5` model, providing high statistical precision across all analysis stages. In this configuration, the workflow was organized into three composite steps:
+1. **Data preparation** including *ntuple creation* and *preprocessing*
+2. **Signal–background (S-B) separation**
+3. **Categorization** based on the expected Higgs-to-photon significance
+Up to five self-correction iterations were applied. The corresponding success rates for these three steps were **58 ± 3%**, **88 ± 2%**, and **74 ± 3%**, respectively.
+As summarized in Table 1, the **data-preparation** stage (combining ntuple production and preprocessing) was the most error-prone, with 93 failures out of 219 trials. Most issues stemmed from insufficient context for identifying objects within ROOT files. The dominant failure modes were Type 1 (zero event weights) and Type 5 (missing intermediate files), indicating persistent challenges in reasoning about file structures, dependencies, and runtime organization. Providing richer execution context such as package lists, file hierarchies, and metadata could help mitigate these problems. Despite these limitations, the overall completion rate (~57%) demonstrates that LLMs can autonomously execute complex domain-specific programming tasks a non-trivial fraction of the time.
+The subsequent **S-B separation** using `TabPFN` and the final **categorization** for the Higgs-to-diphoton measurement exhibited substantially fewer failures (25 and 57, respectively). These errors were primarily Type 3 (function-calling) and Type 6 (semantic) issues, reflecting improved stability once the workflow reaches the model-training and evaluation stages.
+| **Step**                  | **Type 1** | **Type 2** | **Type 3** | **Type 4** | **Type 5** | **Type 6** | **Type 7** |
+|---------------------------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| Data-preparation (93 failures) |    41     |    15     |     1      |     9      |    17      |     6      |     4      |
+| S-B separation (25 failures)   |     0     |     4     |     3      |     0      |    13      |     5      |     0      |
+| Categorization (57 failures)   |     0     |     4     |    26      |     0      |     6      |    17      |     4      |
+*Table 1. Distribution of failure counts by error category for each workflow step, based on 219 trials per step.*
+*Error type definitions:*
+Type 1 — all data weights = 0
+Type 2 — dummy data created
+Type 3 — function-calling error
+Type 4 — incorrect branch name
+Type 5 — intermediate file not found
+Type 6 — semantic error
+Type 7 — other
+To assess agent efficiency, we measured the ratio of user-input tokens to the total tokens exchanged with the API. For the `gemini-pro-2.5` baseline, this ratio was **(1.65 ± 0.15) × 10⁻³** for **data preparation**, **(1.43 ± 0.10) × 10⁻³** for **S-B separation**, and **(0.93 ± 0.07) × 10⁻³** for **categorization**. A higher ratio indicates more efficient task execution, as fewer internally generated tokens are required to complete the same instruction set.
+Over 98% of tokens originated from the model’s autonomous reasoning and self-correction rather than direct user input, suggesting that additional task detail would minimally affect overall token cost. This ratio thus provides a simple diagnostic of communication efficiency and reasoning compactness.
+Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
+<!-- Insert figure here. -->
+The results across models, summarized in Figures 1 and 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other models—each tested with approximately ten trials per step—the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
+Figure 1 summarizes the cross-model performance of the agentic workflow. The heatmap shows the success rate for each model–step pair, highlighting substantial variation in reliability across architectures. The dates appended to each model name denote the model release or update version used for testing, while the parameter in parentheses (e.g., “17B”) indicates the model’s approximate parameter count in billions. Models in the `GPT-5` [OpenAI, 2025](#ref-openai-2025) and `Gemini` [Google, 2025](#ref-google-2025) families achieve the highest completion fractions across most steps, whereas smaller or open-weight models such as `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025) exhibit lower but still non-negligible success on certain steps. This demonstrates that the workflow can be executed end-to-end by multiple LLMs, though with markedly different robustness and learning stability.
+Figure 2 shows the distribution of failure modes across all analysis stages, considering only trials that did not reach a successful completion. The categorization is based on the LLM-generated outputs themselves, capturing how each model typically fails. Error types include function-calling issues, missing or placeholder data, infrastructure problems (e.g., API or file-handling failures), and semantic errors—cases where the model misinterprets the prompt and produces runnable but incorrect code, as well as syntax mistakes.
+Clear model-dependent patterns emerge. Models with higher success rates, such as the `GPT-5` [OpenAI, 2025](#ref-openai-2025) and `Gemini 2.5` [Google, 2025](#ref-google-2025) families, show fewer logic and syntax issues, while smaller or open-weight models exhibit more semantic and data-handling failures. These differences reveal characteristic failure signatures that complement overall completion rates.
+Taken together, Figure 1 and 2 highlights two main aspects of LLM-agent behavior: overall task reliability and the characteristic error modes behind failed executions. These results show that models differ not only in success rate but also in the nature of their failures, providing insight into their robustness and limitations in realistic HEP workflows.
+## Limitations
+This study shows that LLMs can support HEP data analysis workflows by interpreting natural language, generating executable code, and applying basic self-correction. While multi-step task planning is beyond the current scope, the `Snakemake` integration provides a natural path toward rule-based agent planning. Future work will pursue this direction and further strengthen the framework through improvements in prompting, agent design, domain adaptation, and retrieval-augmented generation.
+## Conclusion
+We demonstrated the feasibility of employing LLM agents to automate components of high-energy physics (HEP) data analysis within a reproducible workflow. The proposed hybrid framework combines LLM-based task execution with deterministic workflow management, enabling near end-to-end analyses with limited human oversight. While the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025) served as a statistically stable reference, the broader cross-model evaluation shows that several contemporary LLMs can perform complex HEP tasks with distinct levels of reliability and robustness. Beyond this proof of concept, the framework provides a foundation for systematically assessing and improving LLM-agent performance in domain-specific scientific workflows.
 ---
+## References
+- <span id="ref-atlas-collab-2020"></span> **ATLAS Collaboration.** *ATLAS simulated samples collection for jet reconstruction training, as part of the 2020 Open Data release.* CERN Open Data Portal (2020). [doi:10.7483/OPENDATA.ATLAS.L806.5CKU](http://doi.org/10.7483/OPENDATA.ATLAS.L806.5CKU)
+- <span id="ref-cms-collab-2012"></span> **CMS Collaboration.** *Observation of a New Boson at a Mass of 125 GeV with the CMS Experiment at the LHC.* Phys. Lett. B 716, 30–61 (2012). [arXiv:1207.7235](https://arxiv.org/abs/1207.7235), [doi:10.1016/j.physletb.2012.08.021](https://doi.org/10.1016/j.physletb.2012.08.021)
+- <span id="ref-antcheva-et-al-2009"></span> **Antcheva et al.** *ROOT: A C++ framework for petabyte data storage, statistical analysis and visualization.* Comput. Phys. Commun. 180, 2499–2512 (2009). [doi:10.1016/j.cpc.2009.08.005](https://doi.org/10.1016/j.cpc.2009.08.005)
+- <span id="ref-hollmann-et-al-2025"></span> **Hollmann et al.** *Accurate predictions on small data with a tabular foundation model.* Nature 637, 319–326 (2025). [doi:10.1038/s41586-024-08328-6](https://doi.org/10.1038/s41586-024-08328-6)
+- <span id="ref-xiao-et-al-2024"></span> **Xiao et al.** *CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis.* arXiv:2407.09811 (2024). [https://arxiv.org/abs/2407.09811](https://arxiv.org/abs/2407.09811)
+- <span id="ref-wang-duan-2025"></span> **Wang & Duan.** *Empirical Research on Utilizing LLM-based Agents for Automated Bug Fixing via LangGraph.* arXiv:2502.18465 (2025). [https://arxiv.org/abs/2502.18465](https://arxiv.org/abs/2502.18465)
+- <span id="ref-richmond-et-al-2025"></span> **Richmond et al.** *FeynTune: Large Language Models for High-Energy Theory.* arXiv:2508.03716 (2025). [https://arxiv.org/abs/2508.03716](https://arxiv.org/abs/2508.03716)
+- <span id="ref-simons-2024"></span> **Simons.** *Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts...* arXiv:2411.14877 (2024). [https://arxiv.org/abs/2411.14877](https://arxiv.org/abs/2411.14877)
+- <span id="ref-wu-et-al-2024"></span> **Wu et al.** *Scaling Particle Collision Data Analysis.* arXiv:2412.00129 (2024). [https://arxiv.org/abs/2412.00129](https://arxiv.org/abs/2412.00129)
+- <span id="ref-molder-et-al-2021"></span> **Mölder et al.** *Sustainable data analysis with Snakemake.* F1000Research 10, 33 (2021). [doi:10.12688/f1000research.29032.2](https://doi.org/10.12688/f1000research.29032.2)
+- <span id="ref-chase-2022"></span> **Chase.** *LangChain: A Framework for Large Language Model Applications.* (2022). [https://langchain.com/](https://langchain.com/)
+- <span id="ref-langchain-team-2025"></span> **LangChain Team.** *LangGraph: A Graph-Based Agent Workflow Framework.* (2025). [https://langchain-ai.github.io/langgraph/](https://langchain-ai.github.io/langgraph/)
+- <span id="ref-google-2025"></span> **Google.** Gemini 2.5 Pro. (2025). [https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)
+- <span id="ref-openai-2025"></span> **OpenAI.** GPT-5. (2025). [https://platform.openai.com/docs/models/gpt-5](https://platform.openai.com/docs/models/gpt-5)
+- <span id="ref-anthropic-2024"></span> **Anthropic.** Claude 3.5 Sonnet (2024). [https://www.anthropic.com/news/claude-3-5-sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)
+- <span id="ref-alibaba-2025"></span> **Alibaba.** Qwen3. (2025). [https://qwenlm.github.io/blog/qwen3/](https://qwenlm.github.io/blog/qwen3/)
+- <span id="ref-openai-et-al-2025"></span> **OpenAI et al.** gpt-oss-120b & gpt-oss-20b Model Card (2025). [arXiv:2508.10925](https://arxiv.org/abs/2508.10925)

SETUP.md ADDED Viewed

	@@ -0,0 +1,448 @@

+# Large Language Model Analysis Framework for High Energy Physics
+A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
+## Table of Contents
+- [Setup](#setup)
+- [Data and Solution](#data-and-solution)
+- [Running Tests](#running-tests)
+- [Analysis and Visualization](#analysis-and-visualization)
+- [Project Structure](#project-structure)
+- [Advanced Usage](#advanced-usage)
+---
+## Setup
+### Prerequisites
+**CBORG API Access Required**
+This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
+1. Access to the CBORG API (contact LBL for access)
+2. A CBORG API key
+3. Network access to the CBORG API endpoint
+**Note for External Users:** CBORG is an internal LBL system. External users may need to:
+- Request guest access through LBL collaborations
+- Adapt the code to use OpenAI API directly (requires code modifications)
+- Contact the repository maintainers for alternative deployment options
+### Environment Setup
+Create Conda environment:
+```bash
+mamba env create -f environment.yml
+conda activate llm_env
+```
+### API Configuration
+Create script `~/.apikeys.sh` to export CBORG API key:
+```bash
+export CBORG_API_KEY="INSERT_API_KEY"
+```
+Then source it before running tests:
+```bash
+source ~/.apikeys.sh
+```
+### Initial Configuration
+Before running tests, set up your configuration files:
+```bash
+# Copy example configuration files
+cp config.example.yml config.yml
+cp models.example.txt models.txt
+# Edit config.yml to set your preferred models and parameters
+# Edit models.txt to list models you want to test
+```
+**Important:** The `models.txt` file must end with a blank line.
+---
+## Data and Solution
+### ATLAS Open Data Samples
+All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
+```
+/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
+```
+**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
+```bash
+chmod -R a-w /path/to/data/directory
+```
+### Reference Solution
+- Navigate to `solution/` directory and run `python soln.py`
+- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
+### Reference Arrays for Validation
+Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
+**Quick fetch from repo root:**
+```bash
+bash scripts/fetch_solution_arrays.sh
+```
+**Or copy from NERSC shared path:**
+```
+/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
+```
+---
+## Running Tests
+### Model Configuration
+Three model list files control testing:
+- **`models.txt`**: Models for sequential testing
+- **`models_supervisor.txt`**: Supervisor models for paired testing
+- **`models_coder.txt`**: Coder models for paired testing
+**Important formatting rules:**
+- One model per line
+- File must end with a blank line
+- Repeat model names for multiple trials
+- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
+See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
+### Testing Workflows
+#### 1. Sequential Testing (Single Model at a Time)
+```bash
+bash test_models.sh output_dir_name
+```
+Tests all models in `models.txt` sequentially.
+#### 2. Parallel Testing (Multiple Models)
+```bash
+# Basic parallel execution
+bash test_models_parallel.sh output_dir_name
+# GNU Parallel (recommended for large-scale testing)
+bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
+# Examples:
+bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
+bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
+bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each
+```
+**GNU Parallel features:**
+- Scales to 20-30 models with 200-300 total parallel jobs
+- Automatic resource management
+- Fast I/O using `/dev/shm` temporary workspace
+- Comprehensive error handling and logging
+#### 3. Step-by-Step Testing with Validation
+```bash
+# Run all 5 steps with validation
+./run_smk_sequential.sh --validate
+# Run specific steps
+./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
+# Run individual steps
+./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
+./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
+./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
+./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
+./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization
+# Custom output directory
+./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir
+```
+**Directory naming options:**
+- `--job-id ID`: Creates `results_job_ID/`
+- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
+- `--out-dir DIR`: Custom directory name
+### Validation
+**Automatic validation (during execution):**
+```bash
+./run_smk_sequential.sh --step1 --step2 --validate
+```
+Validation logs saved to `{output_dir}/logs/*_validation.log`
+**Manual validation (after execution):**
+```bash
+# Validate all steps
+python check_soln.py --out_dir results_job_002
+# Validate specific step
+python check_soln.py --out_dir results_job_002 --step 2
+```
+**Validation features:**
+- ✅ Adaptive tolerance with 4 significant digit precision
+- 📊 Column-by-column difference analysis
+- 📋 Side-by-side value comparison
+- 🎯 Clear, actionable error messages
+### Speed Optimization
+Reduce iteration counts in `config.yml`:
+```yaml
+# Limit LLM coder attempts (default 10)
+max_iterations: 3
+```
+---
+## Analysis and Visualization
+### Results Summary
+All test results are aggregated in:
+```
+results_summary.csv
+```
+**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
+### Error Analysis and Categorization
+**Automated error analysis:**
+```bash
+python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
+```
+Uses LLM to analyze comprehensive logs and categorize errors into:
+- Semantic errors
+- Function-calling errors
+- Intermediate file not found
+- Incorrect branch name
+- OpenAI API errors
+- Data quality issues (all weights = 0)
+- Other/uncategorized
+### Interactive Analysis Notebooks
+#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
+Comprehensive analysis of model performance across all 5 workflow steps:
+- **Success rate heatmap** (models × steps)
+- **Agent work progression** (iterations over steps)
+- **API call statistics** (by step and model)
+- **Cost analysis** (input/output tokens, estimated pricing)
+**Output plots:**
+- `plots/1_success_rate_heatmap.pdf`
+- `plots/2_agent_work_line_plot.pdf`
+- `plots/3_api_calls_line_plot.pdf`
+- `plots/4_cost_per_step.pdf`
+- `plots/five_step_summary_stats.csv`
+#### 2. Error Category Analysis (`error_analysis.ipynb`)
+Deep dive into error patterns and failure modes:
+- **Normalized error distribution** (stacked bar chart with percentages)
+- **Error type heatmap** (models × error categories)
+- **Top model breakdowns** (faceted plots for top 9 models)
+- **Error trends across steps** (stacked area chart)
+**Output plots:**
+- `plots/error_distribution_by_model.pdf`
+- `plots/error_heatmap_by_model.pdf`
+- `plots/error_categories_top_models.pdf`
+- `plots/errors_by_step.pdf`
+#### 3. Quick Statistics (`plot_stats.ipynb`)
+Legacy notebook for basic statistics visualization.
+### Log Interpretation
+**Automated log analysis:**
+```bash
+python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
+```
+Analyzes comprehensive supervisor-coder logs to identify:
+- Root causes of failures
+- Responsible parties (user, supervisor, coder, external)
+- Error patterns across iterations
+---
+## Project Structure
+### Core Scripts
+- **`supervisor_coder.py`**: Supervisor-coder framework implementation
+- **`check_soln.py`**: Solution validation with enhanced comparison
+- **`write_prompt.py`**: Prompt management and context chaining
+- **`update_stats.py`**: Statistics tracking and CSV updates
+- **`error_analysis.py`**: LLM-powered error categorization
+### Test Runners
+- **`test_models.sh`**: Sequential model testing
+- **`test_models_parallel.sh`**: Parallel testing (basic)
+- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
+- **`test_stats.sh`**: Individual model statistics
+- **`test_stats_parallel.sh`**: Parallel step execution
+- **`run_smk_sequential.sh`**: Step-by-step workflow runner
+### Snakemake Workflows (`workflow/`)
+The analysis workflow is divided into 5 sequential steps:
+1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
+2. **`create_numpy.smk`**: Convert ROOT → NumPy arrays
+3. **`preprocess.smk`**: Apply preprocessing transformations
+4. **`scores.smk`**: Compute signal/background classification scores
+5. **`categorization.smk`**: Final categorization and statistical analysis
+**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
+### Prompts (`prompts/`)
+- `summarize_root.txt`: Step 1 task description
+- `create_numpy.txt`: Step 2 task description
+- `preprocess.txt`: Step 3 task description
+- `scores.txt`: Step 4 task description
+- `categorization.txt`: Step 5 task description
+- `supervisor_first_call.txt`: Initial supervisor instructions
+- `supervisor_call.txt`: Subsequent supervisor instructions
+### Utility Scripts (`util/`)
+- **`inspect_root.py`**: ROOT file inspection tools
+- **`analyze_particles.py`**: Particle-level analysis
+- **`compare_arrays.py`**: NumPy array comparison utilities
+### Model Documentation
+- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias → actual model mappings
+- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
+- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
+- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
+### Analysis Notebooks
+- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
+- **`error_analysis.ipynb`**: Error categorization and pattern analysis
+- **`error_analysis_plotting.ipynb`**: Additional error visualizations
+- **`plot_stats.ipynb`**: Legacy statistics plots
+### Output Structure
+Each test run creates:
+```
+output_name/
+├── model_timestamp/
+│   ├── generated_code/     # LLM-generated Python scripts
+│   ├── logs/               # Execution logs and supervisor records
+│   ├── arrays/             # NumPy arrays produced by generated code
+│   ├── plots/              # Comparison plots (generated vs. solution)
+│   ├── prompt_pairs/       # User + supervisor prompts
+│   ├── results/            # Temporary ROOT files (job-scoped)
+│   └── snakemake_log/      # Snakemake execution logs
+```
+**Job-scoped ROOT outputs:**
+- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
+- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
+- Automatically cleaned after significance calculation
+---
+## Advanced Usage
+### Supervisor-Coder Configuration
+Control iteration limits in `config.yml`:
+```yaml
+model: 'anthropic/claude-sonnet:latest'
+name: 'experiment_name'
+out_dir: 'results/experiment_name'
+max_iterations: 10  # Maximum supervisor-coder iterations per step
+```
+### Parallel Execution Tuning
+For `test_models_parallel_gnu.sh`:
+```bash
+# Syntax:
+bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
+# Conservative (safe for shared systems):
+bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs
+# Aggressive (dedicated nodes):
+bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs
+```
+### Custom Validation
+Run validation on specific steps or with custom tolerances:
+```bash
+# Validate only data conversion step
+python check_soln.py --out_dir results/ --step 2
+# Check multiple specific steps
+python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
+```
+### Log Analysis Pipeline
+```bash
+# 1. Run tests
+bash test_models_parallel_gnu.sh experiment1 5 5
+# 2. Analyze logs with LLM
+python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
+# 3. Categorize errors
+python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
+# 4. Generate visualizations
+jupyter notebook error_analysis.ipynb
+```
+---
+## Roadmap and Future Directions
+### Planned Improvements
+**Prompt Engineering:**
+- Auto-load context (file lists, logs) at step start
+- Provide comprehensive inputs/outputs/summaries upfront
+- Develop prompt-management layer for cross-analysis reuse
+**Validation & Monitoring:**
+- Embed validation in workflows for immediate error detection
+- Record input/output and state transitions for reproducibility
+- Enhanced situation awareness through comprehensive logging
+**Multi-Analysis Extension:**
+- Rerun H→γγ with improved system prompts
+- Extend to H→4ℓ and other Higgs+X channels
+- Provide learned materials from previous analyses as reference
+**Self-Improvement:**
+- Reinforcement learning–style feedback loops
+- Agent-driven prompt refinement
+- Automatic generalization across HEP analyses
+---
+## Citation and Acknowledgments
+This framework tests LLM agents on ATLAS Open Data from:
+- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
+Models tested via CBORG API (Lawrence Berkeley National Laboratory).
+---
+## Support and Contributing
+For questions or issues:
+1. Check existing documentation in `*.md` files
+2. Review example configurations in `config.yml`
+3. Examine validation logs in output directories
+For contributions, please ensure:
+- Model lists end with blank lines
+- Prompts follow established format
+- Validation passes for all test cases