Large Language Model Analysis Framework for High Energy Physics
A framework for testing and evaluating Large Language Models (LLMs) on ATLAS HβΞ³Ξ³ analysis tasks using a supervisor-coder architecture.
Table of Contents
Setup
Prerequisites
CBORG API Access Required
This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
- Access to the CBORG API (contact LBL for access)
- A CBORG API key
- Network access to the CBORG API endpoint
Note for External Users: CBORG is an internal LBL system. External users may need to:
- Request guest access through LBL collaborations
- Adapt the code to use OpenAI API directly (requires code modifications)
- Contact the repository maintainers for alternative deployment options
Environment Setup
Create Conda environment:
mamba env create -f environment.yml
conda activate llm_env
API Configuration
Create script ~/.apikeys.sh to export CBORG API key:
export CBORG_API_KEY="INSERT_API_KEY"
Then source it before running tests:
source ~/.apikeys.sh
Initial Configuration
Before running tests, set up your configuration files:
# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt
# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test
Important: The models.txt file must end with a blank line.
Data and Solution
ATLAS Open Data Samples
All four data samples and Monte Carlo HiggsβΞ³Ξ³ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
Important: If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
chmod -R a-w /path/to/data/directory
Reference Solution
- Navigate to
solution/directory and runpython soln.py - Use flags:
--step1,--step2,--step3,--plotto control execution
Reference Arrays for Validation
Large .npy reference arrays are not committed to Git (see .gitignore).
Quick fetch from repo root:
bash scripts/fetch_solution_arrays.sh
Or copy from NERSC shared path:
/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
Running Tests
Model Configuration
Three model list files control testing:
models.txt: Models for sequential testingmodels_supervisor.txt: Supervisor models for paired testingmodels_coder.txt: Coder models for paired testing
Important formatting rules:
- One model per line
- File must end with a blank line
- Repeat model names for multiple trials
- Use CBORG aliases (e.g.,
anthropic/claude-sonnet:latest)
See CBORG_MODEL_MAPPINGS.md for available models and their actual versions.
Testing Workflows
1. Sequential Testing (Single Model at a Time)
bash test_models.sh output_dir_name
Tests all models in models.txt sequentially.
2. Parallel Testing (Multiple Models)
# Basic parallel execution
bash test_models_parallel.sh output_dir_name
# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
# Examples:
bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
GNU Parallel features:
- Scales to 20-30 models with 200-300 total parallel jobs
- Automatic resource management
- Fast I/O using
/dev/shmtemporary workspace - Comprehensive error handling and logging
3. Step-by-Step Testing with Validation
# Run all 5 steps with validation
./run_smk_sequential.sh --validate
# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
# Run individual steps
./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
Directory naming options:
--job-id ID: Createsresults_job_ID/--auto-dir: Createsresults_YYYYMMDD_HHMMSS/--out-dir DIR: Custom directory name
Validation
Automatic validation (during execution):
./run_smk_sequential.sh --step1 --step2 --validate
Validation logs saved to {output_dir}/logs/*_validation.log
Manual validation (after execution):
# Validate all steps
python check_soln.py --out_dir results_job_002
# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2
Validation features:
- β Adaptive tolerance with 4 significant digit precision
- π Column-by-column difference analysis
- π Side-by-side value comparison
- π― Clear, actionable error messages
Speed Optimization
Reduce iteration counts in config.yml:
# Limit LLM coder attempts (default 10)
max_iterations: 3
Analysis and Visualization
Results Summary
All test results are aggregated in:
results_summary.csv
Columns include: supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
Error Analysis and Categorization
Automated error analysis:
python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
Uses LLM to analyze comprehensive logs and categorize errors into:
- Semantic errors
- Function-calling errors
- Intermediate file not found
- Incorrect branch name
- OpenAI API errors
- Data quality issues (all weights = 0)
- Other/uncategorized
Interactive Analysis Notebooks
1. Five-Step Performance Analysis (five_step_analysis.ipynb)
Comprehensive analysis of model performance across all 5 workflow steps:
- Success rate heatmap (models Γ steps)
- Agent work progression (iterations over steps)
- API call statistics (by step and model)
- Cost analysis (input/output tokens, estimated pricing)
Output plots:
plots/1_success_rate_heatmap.pdfplots/2_agent_work_line_plot.pdfplots/3_api_calls_line_plot.pdfplots/4_cost_per_step.pdfplots/five_step_summary_stats.csv
2. Error Category Analysis (error_analysis.ipynb)
Deep dive into error patterns and failure modes:
- Normalized error distribution (stacked bar chart with percentages)
- Error type heatmap (models Γ error categories)
- Top model breakdowns (faceted plots for top 9 models)
- Error trends across steps (stacked area chart)
Output plots:
plots/error_distribution_by_model.pdfplots/error_heatmap_by_model.pdfplots/error_categories_top_models.pdfplots/errors_by_step.pdf
3. Quick Statistics (plot_stats.ipynb)
Legacy notebook for basic statistics visualization.
Log Interpretation
Automated log analysis:
python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
Analyzes comprehensive supervisor-coder logs to identify:
- Root causes of failures
- Responsible parties (user, supervisor, coder, external)
- Error patterns across iterations
Project Structure
Core Scripts
supervisor_coder.py: Supervisor-coder framework implementationcheck_soln.py: Solution validation with enhanced comparisonwrite_prompt.py: Prompt management and context chainingupdate_stats.py: Statistics tracking and CSV updateserror_analysis.py: LLM-powered error categorization
Test Runners
test_models.sh: Sequential model testingtest_models_parallel.sh: Parallel testing (basic)test_models_parallel_gnu.sh: GNU Parallel testing (recommended)test_stats.sh: Individual model statisticstest_stats_parallel.sh: Parallel step executionrun_smk_sequential.sh: Step-by-step workflow runner
Snakemake Workflows (workflow/)
The analysis workflow is divided into 5 sequential steps:
summarize_root.smk: Extract ROOT file structure and branch informationcreate_numpy.smk: Convert ROOT β NumPy arrayspreprocess.smk: Apply preprocessing transformationsscores.smk: Compute signal/background classification scorescategorization.smk: Final categorization and statistical analysis
Note: Later steps use solution outputs to enable testing even when earlier steps fail.
Prompts (prompts/)
summarize_root.txt: Step 1 task descriptioncreate_numpy.txt: Step 2 task descriptionpreprocess.txt: Step 3 task descriptionscores.txt: Step 4 task descriptioncategorization.txt: Step 5 task descriptionsupervisor_first_call.txt: Initial supervisor instructionssupervisor_call.txt: Subsequent supervisor instructions
Utility Scripts (util/)
inspect_root.py: ROOT file inspection toolsanalyze_particles.py: Particle-level analysiscompare_arrays.py: NumPy array comparison utilities
Model Documentation
CBORG_MODEL_MAPPINGS.md: CBORG alias β actual model mappingsCOMPLETE_MODEL_VERSIONS.md: Full version information for all tested modelsMODEL_NAME_UPDATES.md: Model name standardization notesO3_MODEL_COMPARISON.md: OpenAI O3 model variant comparison
Analysis Notebooks
five_step_analysis.ipynb: Comprehensive 5-step performance analysiserror_analysis.ipynb: Error categorization and pattern analysiserror_analysis_plotting.ipynb: Additional error visualizationsplot_stats.ipynb: Legacy statistics plots
Output Structure
Each test run creates:
output_name/
βββ model_timestamp/
β βββ generated_code/ # LLM-generated Python scripts
β βββ logs/ # Execution logs and supervisor records
β βββ arrays/ # NumPy arrays produced by generated code
β βββ plots/ # Comparison plots (generated vs. solution)
β βββ prompt_pairs/ # User + supervisor prompts
β βββ results/ # Temporary ROOT files (job-scoped)
β βββ snakemake_log/ # Snakemake execution logs
Job-scoped ROOT outputs:
- Step 5 uses temporary ROOT files (
signal.root,bkgd.root) - Written to
${OUTPUT_DIR}/results/to prevent cross-run interference - Automatically cleaned after significance calculation
Advanced Usage
Supervisor-Coder Configuration
Control iteration limits in config.yml:
model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10 # Maximum supervisor-coder iterations per step
Parallel Execution Tuning
For test_models_parallel_gnu.sh:
# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
Custom Validation
Run validation on specific steps or with custom tolerances:
# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2
# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
Log Analysis Pipeline
# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5
# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
# 4. Generate visualizations
jupyter notebook error_analysis.ipynb
Roadmap and Future Directions
Planned Improvements
Prompt Engineering:
- Auto-load context (file lists, logs) at step start
- Provide comprehensive inputs/outputs/summaries upfront
- Develop prompt-management layer for cross-analysis reuse
Validation & Monitoring:
- Embed validation in workflows for immediate error detection
- Record input/output and state transitions for reproducibility
- Enhanced situation awareness through comprehensive logging
Multi-Analysis Extension:
- Rerun HβΞ³Ξ³ with improved system prompts
- Extend to Hβ4β and other Higgs+X channels
- Provide learned materials from previous analyses as reference
Self-Improvement:
- Reinforcement learningβstyle feedback loops
- Agent-driven prompt refinement
- Automatic generalization across HEP analyses
Citation and Acknowledgments
This framework tests LLM agents on ATLAS Open Data from:
- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
Models tested via CBORG API (Lawrence Berkeley National Laboratory).
Support and Contributing
For questions or issues:
- Check existing documentation in
*.mdfiles - Review example configurations in
config.yml - Examine validation logs in output directories
For contributions, please ensure:
- Model lists end with blank lines
- Prompts follow established format
- Validation passes for all test cases