LLM4HEP / SETUP.md
ho22joshua's picture
new readme following paper, renaming original readme to SETUP.md
242932b

Large Language Model Analysis Framework for High Energy Physics

A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.

Table of Contents


Setup

Prerequisites

CBORG API Access Required

This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:

  1. Access to the CBORG API (contact LBL for access)
  2. A CBORG API key
  3. Network access to the CBORG API endpoint

Note for External Users: CBORG is an internal LBL system. External users may need to:

  • Request guest access through LBL collaborations
  • Adapt the code to use OpenAI API directly (requires code modifications)
  • Contact the repository maintainers for alternative deployment options

Environment Setup

Create Conda environment:

mamba env create -f environment.yml
conda activate llm_env

API Configuration

Create script ~/.apikeys.sh to export CBORG API key:

export CBORG_API_KEY="INSERT_API_KEY"

Then source it before running tests:

source ~/.apikeys.sh

Initial Configuration

Before running tests, set up your configuration files:

# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt

# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test

Important: The models.txt file must end with a blank line.


Data and Solution

ATLAS Open Data Samples

All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:

/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/

Important: If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:

chmod -R a-w /path/to/data/directory

Reference Solution

  • Navigate to solution/ directory and run python soln.py
  • Use flags: --step1, --step2, --step3, --plot to control execution

Reference Arrays for Validation

Large .npy reference arrays are not committed to Git (see .gitignore).

Quick fetch from repo root:

bash scripts/fetch_solution_arrays.sh

Or copy from NERSC shared path:

/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays

Running Tests

Model Configuration

Three model list files control testing:

  • models.txt: Models for sequential testing
  • models_supervisor.txt: Supervisor models for paired testing
  • models_coder.txt: Coder models for paired testing

Important formatting rules:

  • One model per line
  • File must end with a blank line
  • Repeat model names for multiple trials
  • Use CBORG aliases (e.g., anthropic/claude-sonnet:latest)

See CBORG_MODEL_MAPPINGS.md for available models and their actual versions.

Testing Workflows

1. Sequential Testing (Single Model at a Time)

bash test_models.sh output_dir_name

Tests all models in models.txt sequentially.

2. Parallel Testing (Multiple Models)

# Basic parallel execution
bash test_models_parallel.sh output_dir_name

# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]

# Examples:
bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each

GNU Parallel features:

  • Scales to 20-30 models with 200-300 total parallel jobs
  • Automatic resource management
  • Fast I/O using /dev/shm temporary workspace
  • Comprehensive error handling and logging

3. Step-by-Step Testing with Validation

# Run all 5 steps with validation
./run_smk_sequential.sh --validate

# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002

# Run individual steps
./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization

# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir

Directory naming options:

  • --job-id ID: Creates results_job_ID/
  • --auto-dir: Creates results_YYYYMMDD_HHMMSS/
  • --out-dir DIR: Custom directory name

Validation

Automatic validation (during execution):

./run_smk_sequential.sh --step1 --step2 --validate

Validation logs saved to {output_dir}/logs/*_validation.log

Manual validation (after execution):

# Validate all steps
python check_soln.py --out_dir results_job_002

# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2

Validation features:

  • βœ… Adaptive tolerance with 4 significant digit precision
  • πŸ“Š Column-by-column difference analysis
  • πŸ“‹ Side-by-side value comparison
  • 🎯 Clear, actionable error messages

Speed Optimization

Reduce iteration counts in config.yml:

# Limit LLM coder attempts (default 10)
max_iterations: 3

Analysis and Visualization

Results Summary

All test results are aggregated in:

results_summary.csv

Columns include: supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions

Error Analysis and Categorization

Automated error analysis:

python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>

Uses LLM to analyze comprehensive logs and categorize errors into:

  • Semantic errors
  • Function-calling errors
  • Intermediate file not found
  • Incorrect branch name
  • OpenAI API errors
  • Data quality issues (all weights = 0)
  • Other/uncategorized

Interactive Analysis Notebooks

1. Five-Step Performance Analysis (five_step_analysis.ipynb)

Comprehensive analysis of model performance across all 5 workflow steps:

  • Success rate heatmap (models Γ— steps)
  • Agent work progression (iterations over steps)
  • API call statistics (by step and model)
  • Cost analysis (input/output tokens, estimated pricing)

Output plots:

  • plots/1_success_rate_heatmap.pdf
  • plots/2_agent_work_line_plot.pdf
  • plots/3_api_calls_line_plot.pdf
  • plots/4_cost_per_step.pdf
  • plots/five_step_summary_stats.csv

2. Error Category Analysis (error_analysis.ipynb)

Deep dive into error patterns and failure modes:

  • Normalized error distribution (stacked bar chart with percentages)
  • Error type heatmap (models Γ— error categories)
  • Top model breakdowns (faceted plots for top 9 models)
  • Error trends across steps (stacked area chart)

Output plots:

  • plots/error_distribution_by_model.pdf
  • plots/error_heatmap_by_model.pdf
  • plots/error_categories_top_models.pdf
  • plots/errors_by_step.pdf

3. Quick Statistics (plot_stats.ipynb)

Legacy notebook for basic statistics visualization.

Log Interpretation

Automated log analysis:

python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt

Analyzes comprehensive supervisor-coder logs to identify:

  • Root causes of failures
  • Responsible parties (user, supervisor, coder, external)
  • Error patterns across iterations

Project Structure

Core Scripts

  • supervisor_coder.py: Supervisor-coder framework implementation
  • check_soln.py: Solution validation with enhanced comparison
  • write_prompt.py: Prompt management and context chaining
  • update_stats.py: Statistics tracking and CSV updates
  • error_analysis.py: LLM-powered error categorization

Test Runners

  • test_models.sh: Sequential model testing
  • test_models_parallel.sh: Parallel testing (basic)
  • test_models_parallel_gnu.sh: GNU Parallel testing (recommended)
  • test_stats.sh: Individual model statistics
  • test_stats_parallel.sh: Parallel step execution
  • run_smk_sequential.sh: Step-by-step workflow runner

Snakemake Workflows (workflow/)

The analysis workflow is divided into 5 sequential steps:

  1. summarize_root.smk: Extract ROOT file structure and branch information
  2. create_numpy.smk: Convert ROOT β†’ NumPy arrays
  3. preprocess.smk: Apply preprocessing transformations
  4. scores.smk: Compute signal/background classification scores
  5. categorization.smk: Final categorization and statistical analysis

Note: Later steps use solution outputs to enable testing even when earlier steps fail.

Prompts (prompts/)

  • summarize_root.txt: Step 1 task description
  • create_numpy.txt: Step 2 task description
  • preprocess.txt: Step 3 task description
  • scores.txt: Step 4 task description
  • categorization.txt: Step 5 task description
  • supervisor_first_call.txt: Initial supervisor instructions
  • supervisor_call.txt: Subsequent supervisor instructions

Utility Scripts (util/)

  • inspect_root.py: ROOT file inspection tools
  • analyze_particles.py: Particle-level analysis
  • compare_arrays.py: NumPy array comparison utilities

Model Documentation

  • CBORG_MODEL_MAPPINGS.md: CBORG alias β†’ actual model mappings
  • COMPLETE_MODEL_VERSIONS.md: Full version information for all tested models
  • MODEL_NAME_UPDATES.md: Model name standardization notes
  • O3_MODEL_COMPARISON.md: OpenAI O3 model variant comparison

Analysis Notebooks

  • five_step_analysis.ipynb: Comprehensive 5-step performance analysis
  • error_analysis.ipynb: Error categorization and pattern analysis
  • error_analysis_plotting.ipynb: Additional error visualizations
  • plot_stats.ipynb: Legacy statistics plots

Output Structure

Each test run creates:

output_name/
β”œβ”€β”€ model_timestamp/
β”‚   β”œβ”€β”€ generated_code/     # LLM-generated Python scripts
β”‚   β”œβ”€β”€ logs/               # Execution logs and supervisor records
β”‚   β”œβ”€β”€ arrays/             # NumPy arrays produced by generated code
β”‚   β”œβ”€β”€ plots/              # Comparison plots (generated vs. solution)
β”‚   β”œβ”€β”€ prompt_pairs/       # User + supervisor prompts
β”‚   β”œβ”€β”€ results/            # Temporary ROOT files (job-scoped)
β”‚   └── snakemake_log/      # Snakemake execution logs

Job-scoped ROOT outputs:

  • Step 5 uses temporary ROOT files (signal.root, bkgd.root)
  • Written to ${OUTPUT_DIR}/results/ to prevent cross-run interference
  • Automatically cleaned after significance calculation

Advanced Usage

Supervisor-Coder Configuration

Control iteration limits in config.yml:

model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10  # Maximum supervisor-coder iterations per step

Parallel Execution Tuning

For test_models_parallel_gnu.sh:

# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>

# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs

# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs

Custom Validation

Run validation on specific steps or with custom tolerances:

# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2

# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4

Log Analysis Pipeline

# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5

# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt

# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv

# 4. Generate visualizations
jupyter notebook error_analysis.ipynb

Roadmap and Future Directions

Planned Improvements

Prompt Engineering:

  • Auto-load context (file lists, logs) at step start
  • Provide comprehensive inputs/outputs/summaries upfront
  • Develop prompt-management layer for cross-analysis reuse

Validation & Monitoring:

  • Embed validation in workflows for immediate error detection
  • Record input/output and state transitions for reproducibility
  • Enhanced situation awareness through comprehensive logging

Multi-Analysis Extension:

  • Rerun Hβ†’Ξ³Ξ³ with improved system prompts
  • Extend to Hβ†’4β„“ and other Higgs+X channels
  • Provide learned materials from previous analyses as reference

Self-Improvement:

  • Reinforcement learning–style feedback loops
  • Agent-driven prompt refinement
  • Automatic generalization across HEP analyses

Citation and Acknowledgments

This framework tests LLM agents on ATLAS Open Data from:

Models tested via CBORG API (Lawrence Berkeley National Laboratory).


Support and Contributing

For questions or issues:

  1. Check existing documentation in *.md files
  2. Review example configurations in config.yml
  3. Examine validation logs in output directories

For contributions, please ensure:

  • Model lists end with blank lines
  • Prompts follow established format
  • Validation passes for all test cases