LLM4HEP / SETUP.md

ho22joshua

new readme following paper, renaming original readme to SETUP.md

242932b about 1 month ago

preview code

raw

history blame contribute delete

13.9 kB

Large Language Model Analysis Framework for High Energy Physics

A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.

Setup
Data and Solution
Running Tests
Analysis and Visualization
Project Structure
Advanced Usage

Setup

Prerequisites

CBORG API Access Required

This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:

Access to the CBORG API (contact LBL for access)
A CBORG API key
Network access to the CBORG API endpoint

Note for External Users: CBORG is an internal LBL system. External users may need to:

Request guest access through LBL collaborations
Adapt the code to use OpenAI API directly (requires code modifications)
Contact the repository maintainers for alternative deployment options

Environment Setup

Create Conda environment:

mamba env create -f environment.yml
conda activate llm_env

API Configuration

Create script ~/.apikeys.sh to export CBORG API key:

export CBORG_API_KEY="INSERT_API_KEY"

Then source it before running tests:

source ~/.apikeys.sh

Initial Configuration

Before running tests, set up your configuration files:

# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt

# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test

Important: The models.txt file must end with a blank line.

Data and Solution

ATLAS Open Data Samples

All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:

/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/

Important: If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:

chmod -R a-w /path/to/data/directory

Reference Solution

Navigate to solution/ directory and run python soln.py
Use flags: --step1, --step2, --step3, --plot to control execution

Reference Arrays for Validation

Large .npy reference arrays are not committed to Git (see .gitignore).

Quick fetch from repo root:

bash scripts/fetch_solution_arrays.sh

Or copy from NERSC shared path:

/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays

Running Tests

Model Configuration

Three model list files control testing:

models.txt: Models for sequential testing
models_supervisor.txt: Supervisor models for paired testing
models_coder.txt: Coder models for paired testing

Important formatting rules:

One model per line
File must end with a blank line
Repeat model names for multiple trials
Use CBORG aliases (e.g., anthropic/claude-sonnet:latest)

See CBORG_MODEL_MAPPINGS.md for available models and their actual versions.

Testing Workflows

1. Sequential Testing (Single Model at a Time)

bash test_models.sh output_dir_name

Tests all models in models.txt sequentially.

2. Parallel Testing (Multiple Models)

# Basic parallel execution
bash test_models_parallel.sh output_dir_name

# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]

# Examples:
bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each

GNU Parallel features:

Scales to 20-30 models with 200-300 total parallel jobs
Automatic resource management
Fast I/O using /dev/shm temporary workspace
Comprehensive error handling and logging

3. Step-by-Step Testing with Validation

# Run all 5 steps with validation
./run_smk_sequential.sh --validate

# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002

# Run individual steps
./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization

# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir

Directory naming options:

--job-id ID: Creates results_job_ID/
--auto-dir: Creates results_YYYYMMDD_HHMMSS/
--out-dir DIR: Custom directory name

Validation

Automatic validation (during execution):

./run_smk_sequential.sh --step1 --step2 --validate

Validation logs saved to {output_dir}/logs/*_validation.log

Manual validation (after execution):

# Validate all steps
python check_soln.py --out_dir results_job_002

# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2

Validation features:

✅ Adaptive tolerance with 4 significant digit precision
📊 Column-by-column difference analysis
📋 Side-by-side value comparison
🎯 Clear, actionable error messages

Speed Optimization

Reduce iteration counts in config.yml:

# Limit LLM coder attempts (default 10)
max_iterations: 3

Analysis and Visualization

Results Summary

All test results are aggregated in:

results_summary.csv

Columns include: supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions

Error Analysis and Categorization

Automated error analysis:

python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>

Uses LLM to analyze comprehensive logs and categorize errors into:

Semantic errors
Function-calling errors
Intermediate file not found
Incorrect branch name
OpenAI API errors
Data quality issues (all weights = 0)
Other/uncategorized

Interactive Analysis Notebooks

1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)

Comprehensive analysis of model performance across all 5 workflow steps:

Success rate heatmap (models × steps)
Agent work progression (iterations over steps)
API call statistics (by step and model)
Cost analysis (input/output tokens, estimated pricing)

Output plots:

plots/1_success_rate_heatmap.pdf
plots/2_agent_work_line_plot.pdf
plots/3_api_calls_line_plot.pdf
plots/4_cost_per_step.pdf
plots/five_step_summary_stats.csv

2. Error Category Analysis (`error_analysis.ipynb`)

Deep dive into error patterns and failure modes:

Normalized error distribution (stacked bar chart with percentages)
Error type heatmap (models × error categories)
Top model breakdowns (faceted plots for top 9 models)
Error trends across steps (stacked area chart)

Output plots:

plots/error_distribution_by_model.pdf
plots/error_heatmap_by_model.pdf
plots/error_categories_top_models.pdf
plots/errors_by_step.pdf

3. Quick Statistics (`plot_stats.ipynb`)

Legacy notebook for basic statistics visualization.

Log Interpretation

Automated log analysis:

python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt

Analyzes comprehensive supervisor-coder logs to identify:

Root causes of failures
Responsible parties (user, supervisor, coder, external)
Error patterns across iterations

Project Structure

Core Scripts

supervisor_coder.py: Supervisor-coder framework implementation
check_soln.py: Solution validation with enhanced comparison
write_prompt.py: Prompt management and context chaining
update_stats.py: Statistics tracking and CSV updates
error_analysis.py: LLM-powered error categorization

Test Runners

test_models.sh: Sequential model testing
test_models_parallel.sh: Parallel testing (basic)
test_models_parallel_gnu.sh: GNU Parallel testing (recommended)
test_stats.sh: Individual model statistics
test_stats_parallel.sh: Parallel step execution
run_smk_sequential.sh: Step-by-step workflow runner

Snakemake Workflows (`workflow/`)

The analysis workflow is divided into 5 sequential steps:

summarize_root.smk: Extract ROOT file structure and branch information
create_numpy.smk: Convert ROOT → NumPy arrays
preprocess.smk: Apply preprocessing transformations
scores.smk: Compute signal/background classification scores
categorization.smk: Final categorization and statistical analysis

Note: Later steps use solution outputs to enable testing even when earlier steps fail.

Prompts (`prompts/`)

summarize_root.txt: Step 1 task description
create_numpy.txt: Step 2 task description
preprocess.txt: Step 3 task description
scores.txt: Step 4 task description
categorization.txt: Step 5 task description
supervisor_first_call.txt: Initial supervisor instructions
supervisor_call.txt: Subsequent supervisor instructions

Utility Scripts (`util/`)

inspect_root.py: ROOT file inspection tools
analyze_particles.py: Particle-level analysis
compare_arrays.py: NumPy array comparison utilities

Model Documentation

CBORG_MODEL_MAPPINGS.md: CBORG alias → actual model mappings
COMPLETE_MODEL_VERSIONS.md: Full version information for all tested models
MODEL_NAME_UPDATES.md: Model name standardization notes
O3_MODEL_COMPARISON.md: OpenAI O3 model variant comparison

Analysis Notebooks

five_step_analysis.ipynb: Comprehensive 5-step performance analysis
error_analysis.ipynb: Error categorization and pattern analysis
error_analysis_plotting.ipynb: Additional error visualizations
plot_stats.ipynb: Legacy statistics plots

Output Structure

Each test run creates:

output_name/
├── model_timestamp/
│   ├── generated_code/     # LLM-generated Python scripts
│   ├── logs/               # Execution logs and supervisor records
│   ├── arrays/             # NumPy arrays produced by generated code
│   ├── plots/              # Comparison plots (generated vs. solution)
│   ├── prompt_pairs/       # User + supervisor prompts
│   ├── results/            # Temporary ROOT files (job-scoped)
│   └── snakemake_log/      # Snakemake execution logs

Job-scoped ROOT outputs:

Step 5 uses temporary ROOT files (signal.root, bkgd.root)
Written to ${OUTPUT_DIR}/results/ to prevent cross-run interference
Automatically cleaned after significance calculation

Advanced Usage

Supervisor-Coder Configuration

Control iteration limits in config.yml:

model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10  # Maximum supervisor-coder iterations per step

Parallel Execution Tuning

For test_models_parallel_gnu.sh:

# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>

# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs

# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs

Custom Validation

Run validation on specific steps or with custom tolerances:

# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2

# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4

Log Analysis Pipeline

# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5

# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt

# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv

# 4. Generate visualizations
jupyter notebook error_analysis.ipynb

Roadmap and Future Directions

Planned Improvements

Prompt Engineering:

Auto-load context (file lists, logs) at step start
Provide comprehensive inputs/outputs/summaries upfront
Develop prompt-management layer for cross-analysis reuse

Validation & Monitoring:

Embed validation in workflows for immediate error detection
Record input/output and state transitions for reproducibility
Enhanced situation awareness through comprehensive logging

Multi-Analysis Extension:

Rerun H→γγ with improved system prompts
Extend to H→4ℓ and other Higgs+X channels
Provide learned materials from previous analyses as reference

Self-Improvement:

Reinforcement learning–style feedback loops
Agent-driven prompt refinement
Automatic generalization across HEP analyses

Citation and Acknowledgments

This framework tests LLM agents on ATLAS Open Data from:

2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006

Models tested via CBORG API (Lawrence Berkeley National Laboratory).

Support and Contributing

For questions or issues:

Check existing documentation in *.md files
Review example configurations in config.yml
Examine validation logs in output directories

For contributions, please ensure:

Model lists end with blank lines
Prompts follow established format
Validation passes for all test cases