ho22joshua commited on
Commit
242932b
Β·
1 Parent(s): 751d271

new readme following paper, renaming original readme to SETUP.md

Browse files
Files changed (2) hide show
  1. README.md +87 -396
  2. SETUP.md +448 -0
README.md CHANGED
@@ -1,448 +1,139 @@
1
- # Large Language Model Analysis Framework for High Energy Physics
2
 
3
- A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
4
 
5
- ## Table of Contents
6
- - [Setup](#setup)
7
- - [Data and Solution](#data-and-solution)
8
- - [Running Tests](#running-tests)
9
- - [Analysis and Visualization](#analysis-and-visualization)
10
- - [Project Structure](#project-structure)
11
- - [Advanced Usage](#advanced-usage)
12
 
13
- ---
14
-
15
- ## Setup
16
-
17
- ### Prerequisites
18
 
19
- **CBORG API Access Required**
20
 
21
- This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
22
 
23
- 1. Access to the CBORG API (contact LBL for access)
24
- 2. A CBORG API key
25
- 3. Network access to the CBORG API endpoint
26
 
27
- **Note for External Users:** CBORG is an internal LBL system. External users may need to:
28
- - Request guest access through LBL collaborations
29
- - Adapt the code to use OpenAI API directly (requires code modifications)
30
- - Contact the repository maintainers for alternative deployment options
31
 
32
- ### Environment Setup
33
- Create Conda environment:
34
- ```bash
35
- mamba env create -f environment.yml
36
- conda activate llm_env
37
- ```
38
 
39
- ### API Configuration
40
- Create script `~/.apikeys.sh` to export CBORG API key:
41
- ```bash
42
- export CBORG_API_KEY="INSERT_API_KEY"
43
- ```
44
 
45
- Then source it before running tests:
46
- ```bash
47
- source ~/.apikeys.sh
48
- ```
49
 
50
- ### Initial Configuration
51
 
52
- Before running tests, set up your configuration files:
53
 
54
- ```bash
55
- # Copy example configuration files
56
- cp config.example.yml config.yml
57
- cp models.example.txt models.txt
58
 
59
- # Edit config.yml to set your preferred models and parameters
60
- # Edit models.txt to list models you want to test
61
- ```
62
 
63
- **Important:** The `models.txt` file must end with a blank line.
64
-
65
- ---
66
 
67
- ## Data and Solution
 
68
 
69
- ### ATLAS Open Data Samples
70
- All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
71
- ```
72
- /global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
73
- ```
74
 
75
- **Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
76
- ```bash
77
- chmod -R a-w /path/to/data/directory
78
- ```
79
 
80
- ### Reference Solution
81
- - Navigate to `solution/` directory and run `python soln.py`
82
- - Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
83
 
84
- ### Reference Arrays for Validation
85
- Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
86
 
87
- **Quick fetch from repo root:**
88
- ```bash
89
- bash scripts/fetch_solution_arrays.sh
90
- ```
91
 
92
- **Or copy from NERSC shared path:**
93
- ```
94
- /global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
95
- ```
96
 
97
- ---
98
 
99
- ## Running Tests
100
-
101
- ### Model Configuration
102
-
103
- Three model list files control testing:
104
- - **`models.txt`**: Models for sequential testing
105
- - **`models_supervisor.txt`**: Supervisor models for paired testing
106
- - **`models_coder.txt`**: Coder models for paired testing
107
-
108
- **Important formatting rules:**
109
- - One model per line
110
- - File must end with a blank line
111
- - Repeat model names for multiple trials
112
- - Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
113
-
114
- See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
115
-
116
- ### Testing Workflows
117
-
118
- #### 1. Sequential Testing (Single Model at a Time)
119
- ```bash
120
- bash test_models.sh output_dir_name
121
- ```
122
- Tests all models in `models.txt` sequentially.
123
-
124
- #### 2. Parallel Testing (Multiple Models)
125
- ```bash
126
- # Basic parallel execution
127
- bash test_models_parallel.sh output_dir_name
128
-
129
- # GNU Parallel (recommended for large-scale testing)
130
- bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
131
-
132
- # Examples:
133
- bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
134
- bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
135
- bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
136
- ```
137
-
138
- **GNU Parallel features:**
139
- - Scales to 20-30 models with 200-300 total parallel jobs
140
- - Automatic resource management
141
- - Fast I/O using `/dev/shm` temporary workspace
142
- - Comprehensive error handling and logging
143
-
144
- #### 3. Step-by-Step Testing with Validation
145
- ```bash
146
- # Run all 5 steps with validation
147
- ./run_smk_sequential.sh --validate
148
-
149
- # Run specific steps
150
- ./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
151
-
152
- # Run individual steps
153
- ./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
154
- ./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
155
- ./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
156
- ./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
157
- ./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
158
-
159
- # Custom output directory
160
- ./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
161
- ```
162
-
163
- **Directory naming options:**
164
- - `--job-id ID`: Creates `results_job_ID/`
165
- - `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
166
- - `--out-dir DIR`: Custom directory name
167
-
168
- ### Validation
169
-
170
- **Automatic validation (during execution):**
171
- ```bash
172
- ./run_smk_sequential.sh --step1 --step2 --validate
173
- ```
174
- Validation logs saved to `{output_dir}/logs/*_validation.log`
175
-
176
- **Manual validation (after execution):**
177
- ```bash
178
- # Validate all steps
179
- python check_soln.py --out_dir results_job_002
180
-
181
- # Validate specific step
182
- python check_soln.py --out_dir results_job_002 --step 2
183
- ```
184
-
185
- **Validation features:**
186
- - βœ… Adaptive tolerance with 4 significant digit precision
187
- - πŸ“Š Column-by-column difference analysis
188
- - πŸ“‹ Side-by-side value comparison
189
- - 🎯 Clear, actionable error messages
190
-
191
- ### Speed Optimization
192
-
193
- Reduce iteration counts in `config.yml`:
194
- ```yaml
195
- # Limit LLM coder attempts (default 10)
196
- max_iterations: 3
197
- ```
198
 
199
- ---
200
 
201
- ## Analysis and Visualization
202
-
203
- ### Results Summary
204
- All test results are aggregated in:
205
- ```
206
- results_summary.csv
207
- ```
208
-
209
- **Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
210
-
211
- ### Error Analysis and Categorization
212
-
213
- **Automated error analysis:**
214
- ```bash
215
- python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
216
- ```
217
-
218
- Uses LLM to analyze comprehensive logs and categorize errors into:
219
- - Semantic errors
220
- - Function-calling errors
221
- - Intermediate file not found
222
- - Incorrect branch name
223
- - OpenAI API errors
224
- - Data quality issues (all weights = 0)
225
- - Other/uncategorized
226
-
227
- ### Interactive Analysis Notebooks
228
-
229
- #### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
230
- Comprehensive analysis of model performance across all 5 workflow steps:
231
- - **Success rate heatmap** (models Γ— steps)
232
- - **Agent work progression** (iterations over steps)
233
- - **API call statistics** (by step and model)
234
- - **Cost analysis** (input/output tokens, estimated pricing)
235
-
236
- **Output plots:**
237
- - `plots/1_success_rate_heatmap.pdf`
238
- - `plots/2_agent_work_line_plot.pdf`
239
- - `plots/3_api_calls_line_plot.pdf`
240
- - `plots/4_cost_per_step.pdf`
241
- - `plots/five_step_summary_stats.csv`
242
-
243
- #### 2. Error Category Analysis (`error_analysis.ipynb`)
244
- Deep dive into error patterns and failure modes:
245
- - **Normalized error distribution** (stacked bar chart with percentages)
246
- - **Error type heatmap** (models Γ— error categories)
247
- - **Top model breakdowns** (faceted plots for top 9 models)
248
- - **Error trends across steps** (stacked area chart)
249
-
250
- **Output plots:**
251
- - `plots/error_distribution_by_model.pdf`
252
- - `plots/error_heatmap_by_model.pdf`
253
- - `plots/error_categories_top_models.pdf`
254
- - `plots/errors_by_step.pdf`
255
-
256
- #### 3. Quick Statistics (`plot_stats.ipynb`)
257
- Legacy notebook for basic statistics visualization.
258
-
259
- ### Log Interpretation
260
-
261
- **Automated log analysis:**
262
- ```bash
263
- python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
264
- ```
265
-
266
- Analyzes comprehensive supervisor-coder logs to identify:
267
- - Root causes of failures
268
- - Responsible parties (user, supervisor, coder, external)
269
- - Error patterns across iterations
270
 
271
- ---
272
 
273
- ## Project Structure
274
-
275
- ### Core Scripts
276
- - **`supervisor_coder.py`**: Supervisor-coder framework implementation
277
- - **`check_soln.py`**: Solution validation with enhanced comparison
278
- - **`write_prompt.py`**: Prompt management and context chaining
279
- - **`update_stats.py`**: Statistics tracking and CSV updates
280
- - **`error_analysis.py`**: LLM-powered error categorization
281
-
282
- ### Test Runners
283
- - **`test_models.sh`**: Sequential model testing
284
- - **`test_models_parallel.sh`**: Parallel testing (basic)
285
- - **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
286
- - **`test_stats.sh`**: Individual model statistics
287
- - **`test_stats_parallel.sh`**: Parallel step execution
288
- - **`run_smk_sequential.sh`**: Step-by-step workflow runner
289
-
290
- ### Snakemake Workflows (`workflow/`)
291
- The analysis workflow is divided into 5 sequential steps:
292
-
293
- 1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
294
- 2. **`create_numpy.smk`**: Convert ROOT β†’ NumPy arrays
295
- 3. **`preprocess.smk`**: Apply preprocessing transformations
296
- 4. **`scores.smk`**: Compute signal/background classification scores
297
- 5. **`categorization.smk`**: Final categorization and statistical analysis
298
-
299
- **Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
300
-
301
- ### Prompts (`prompts/`)
302
- - `summarize_root.txt`: Step 1 task description
303
- - `create_numpy.txt`: Step 2 task description
304
- - `preprocess.txt`: Step 3 task description
305
- - `scores.txt`: Step 4 task description
306
- - `categorization.txt`: Step 5 task description
307
- - `supervisor_first_call.txt`: Initial supervisor instructions
308
- - `supervisor_call.txt`: Subsequent supervisor instructions
309
-
310
- ### Utility Scripts (`util/`)
311
- - **`inspect_root.py`**: ROOT file inspection tools
312
- - **`analyze_particles.py`**: Particle-level analysis
313
- - **`compare_arrays.py`**: NumPy array comparison utilities
314
-
315
- ### Model Documentation
316
- - **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β†’ actual model mappings
317
- - **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
318
- - **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
319
- - **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
320
-
321
- ### Analysis Notebooks
322
- - **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
323
- - **`error_analysis.ipynb`**: Error categorization and pattern analysis
324
- - **`error_analysis_plotting.ipynb`**: Additional error visualizations
325
- - **`plot_stats.ipynb`**: Legacy statistics plots
326
-
327
- ### Output Structure
328
- Each test run creates:
329
- ```
330
- output_name/
331
- β”œβ”€β”€ model_timestamp/
332
- β”‚ β”œβ”€β”€ generated_code/ # LLM-generated Python scripts
333
- β”‚ β”œβ”€β”€ logs/ # Execution logs and supervisor records
334
- β”‚ β”œβ”€β”€ arrays/ # NumPy arrays produced by generated code
335
- β”‚ β”œβ”€β”€ plots/ # Comparison plots (generated vs. solution)
336
- β”‚ β”œβ”€β”€ prompt_pairs/ # User + supervisor prompts
337
- β”‚ β”œβ”€β”€ results/ # Temporary ROOT files (job-scoped)
338
- β”‚ └── snakemake_log/ # Snakemake execution logs
339
- ```
340
-
341
- **Job-scoped ROOT outputs:**
342
- - Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
343
- - Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
344
- - Automatically cleaned after significance calculation
345
 
346
- ---
347
 
348
- ## Advanced Usage
 
 
349
 
350
- ### Supervisor-Coder Configuration
351
 
352
- Control iteration limits in `config.yml`:
353
- ```yaml
354
- model: 'anthropic/claude-sonnet:latest'
355
- name: 'experiment_name'
356
- out_dir: 'results/experiment_name'
357
- max_iterations: 10 # Maximum supervisor-coder iterations per step
358
- ```
359
 
360
- ### Parallel Execution Tuning
361
 
362
- For `test_models_parallel_gnu.sh`:
363
- ```bash
364
- # Syntax:
365
- bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
 
366
 
367
- # Conservative (safe for shared systems):
368
- bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
369
 
370
- # Aggressive (dedicated nodes):
371
- bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
372
- ```
 
 
 
 
 
373
 
374
- ### Custom Validation
375
 
376
- Run validation on specific steps or with custom tolerances:
377
- ```bash
378
- # Validate only data conversion step
379
- python check_soln.py --out_dir results/ --step 2
380
 
381
- # Check multiple specific steps
382
- python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
383
- ```
384
 
385
- ### Log Analysis Pipeline
386
 
387
- ```bash
388
- # 1. Run tests
389
- bash test_models_parallel_gnu.sh experiment1 5 5
390
 
391
- # 2. Analyze logs with LLM
392
- python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
393
 
394
- # 3. Categorize errors
395
- python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
396
 
397
- # 4. Generate visualizations
398
- jupyter notebook error_analysis.ipynb
399
- ```
400
 
401
- ---
402
 
403
- ## Roadmap and Future Directions
404
 
405
- ### Planned Improvements
406
 
407
- **Prompt Engineering:**
408
- - Auto-load context (file lists, logs) at step start
409
- - Provide comprehensive inputs/outputs/summaries upfront
410
- - Develop prompt-management layer for cross-analysis reuse
411
 
412
- **Validation & Monitoring:**
413
- - Embed validation in workflows for immediate error detection
414
- - Record input/output and state transitions for reproducibility
415
- - Enhanced situation awareness through comprehensive logging
416
-
417
- **Multi-Analysis Extension:**
418
- - Rerun H→γγ with improved system prompts
419
- - Extend to H→4ℓ and other Higgs+X channels
420
- - Provide learned materials from previous analyses as reference
421
-
422
- **Self-Improvement:**
423
- - Reinforcement learning–style feedback loops
424
- - Agent-driven prompt refinement
425
- - Automatic generalization across HEP analyses
426
 
427
  ---
428
 
429
- ## Citation and Acknowledgments
430
-
431
- This framework tests LLM agents on ATLAS Open Data from:
432
- - 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
433
-
434
- Models tested via CBORG API (Lawrence Berkeley National Laboratory).
435
-
436
- ---
437
-
438
- ## Support and Contributing
439
-
440
- For questions or issues:
441
- 1. Check existing documentation in `*.md` files
442
- 2. Review example configurations in `config.yml`
443
- 3. Examine validation logs in output directories
444
-
445
- For contributions, please ensure:
446
- - Model lists end with blank lines
447
- - Prompts follow established format
448
- - Validation passes for all test cases
 
1
+ ## Abstract
2
 
3
+ We present a proof-of-principle study demonstrating the use of large language model (LLM) agents to automate a representative high energy physics (HEP) analysis. Using the Higgs boson diphoton cross-section measurement as a case study with ATLAS Open Data, we design a hybrid system that combines an LLM-based supervisor–coder agent with the `Snakemake` workflow manager.
4
 
5
+ In this architecture, the workflow manager enforces reproducibility and determinism, while the agent autonomously generates, executes, and iteratively corrects analysis code in response to user instructions. We define quantitative evaluation metricsβ€”success rate and error distributionβ€”to assess agent performance across multi-stage workflows.
 
 
 
 
 
 
6
 
7
+ To characterize variability across architectures, we benchmark a representative selection of state-of-the-art LLMs, spanning the *Gemini* and *GPT-5* series, the *Claude* family, and leading open-weight models. While the workflow manager ensures deterministic execution of all analysis steps, the final outputs still show stochastic variation. Although we set the temperature to zero, other sampling parameters (e.g., top-p, top-k) remained at their defaults, and some reasoning-oriented models internally adjust these settings. Consequently, the models do not produce fully deterministic results.
 
 
 
 
8
 
9
+ This study establishes the first LLM-agent-driven automated data-analysis framework in HEP, enabling systematic benchmarking of model capabilities, stability, and limitations in real-world scientific computing environments.
10
 
11
+ The baseline code used in this work is [available here](https://huggingface.co/HWresearch/LLM4HEP/tree/main).
12
 
13
+ ## Introduction
 
 
14
 
15
+ While large language models (LLMs) and agentic systems have been explored for automating components of scientific discovery in fields such as biology (e.g., CellAgent for single-cell RNA-seq analysis [Xiao et al., 2024](#ref-xiao-et-al-2024)) and software engineering (e.g., LangGraph-based bug fixing agents [Wang & Duan, 2025](#ref-wang-duan-2025)), their application to high energy physics (HEP) remains largely unexplored. In HEP, analysis pipelines are highly structured, resource-intensive, and have strict requirements of reproducibility and statistical rigor. Existing work in HEP has primarily focused on machine learning models for event classification or simulation acceleration, and more recently on domain-adapted LLMs (e.g., FeynTune [Richmond et al., 2025](#ref-richmond-et-al-2025), Astro-HEP-BERT [Simons, 2024](#ref-simons-2024), BBT-Neutron [Wu et al., 2024](#ref-wu-et-al-2024)) and conceptual roadmaps for large physics models. However, to the best of our knowledge, no published study has demonstrated an agentic system that autonomously generates, executes, validates, and iterates on HEP data analysis workflows.
 
 
 
16
 
17
+ In this work, we present the first attempt to operationalize LLM-based agents within a reproducible HEP analysis pipeline. Our approach integrates a task-focused agent with the `Snakemake` workflow manager [MΓΆlder et al., 2021](#ref-molder-et-al-2021), leveraging `Snakemake`’s HPC-native execution and file-based provenance to enforce determinism, while delegating bounded subtasks (e.g., event selection, code generation, validation, and self-correction) to an LLM agent. This design departs from prior agent frameworks such as LangChain [Chase, 2022](#ref-chase-2022) or LangGraph [LangChain Team, 2025](#ref-langchain-team-2025) that emphasize flexible multi-step reasoning by embedding the agent within a domain-constrained directed acyclic graph (DAG), ensuring both scientific reliability and AI relevance.
 
 
 
 
 
18
 
19
+ ## Description of a Representative High Energy Physics Analysis
 
 
 
 
20
 
21
+ In this paper, we use a cross-section measurement of the Higgs boson decaying to the diphoton channel at the Large Hadron Collider as an example. We employ collision data and simulation samples from the 2020 ATLAS Open Data release [ATLAS Collaboration, 2020](#ref-atlas-collab-2020). The data sample corresponds to **10β€―fb⁻¹** of proton-proton collisions collected in 2016 at $\sqrt{s} = 13$Β TeV. Higgs boson production is simulated for gluon fusion, vector boson fusion, associated production with a vector boson, and associated production with a top-quark pair.
 
 
 
22
 
23
+ Our example analysis uses control samples derived from collision data and Higgs production simulation samples to design a categorized analysis with a machine-learning-based event classifier, similar to the one used by the CMS collaboration in their Higgs observation paper [CMS Collaboration, 2012](#ref-cms-collab-2012). The objective of this analysis is to maximize the expected significance of the Higgs-to-diphoton signal. Since the purpose of this work is to demonstrate the LLM-based agentic implementation of a HEP analysis, we do not report the observed significance. The workflow of this analysis is highly representative in HEP.
24
 
25
+ The technical implementation of the analysis workflow is factorized into five sequential steps, executed through the `Snakemake` workflow management system. Each step is designed to test a distinct type of reasoning or code-generation capability, and the evaluation of each step is performed independently, i.e., the success of a given step does not depend on the completion or correctness of the previous one. This design allows for consistent benchmarking across heterogeneous tasks while maintaining deterministic workflow execution.
26
 
27
+ **Step 1 (ROOT file inspection):**
28
+ Generates summary text files describing the ROOT files and their internal structure, including trees and branches, to provide the agent with a human-readable overview of the available data.
 
 
29
 
30
+ **Step 2 (Ntuple conversion):**
31
+ Produces a Python script that reads all particle observables specified in the user prompt from the `TTree` objects in ROOT files [Antcheva et al., 2009](#ref-antcheva-et-al-2009) and converts them into `numpy` arrays for downstream analysis.
 
32
 
33
+ **Step 3 (Preprocessing):**
34
+ Normalizes the signal and background arrays and applies standard selection criteria to prepare the datasets for machine-learning–based classification.
 
35
 
36
+ **Step 4 (S-B separation):**
37
+ Applies TabPFN [Hollmann et al., 2025](#ref-hollmann-et-al-2025), a transformer-based foundation model for tabular data, to perform signal–background separation. The workflow requires a script that calls a provided function to train and evaluate the TabPFN model using the appropriate datasets and hyperparameters.
38
 
39
+ **Step 5 (Categorization):**
40
+ Performs statistical categorization of events by defining optimized boundaries on the TabPFN score to maximize the expected significance of the Higgs-to-diphoton signal. This step is implemented as an iterative function-calling procedure, where new category boundaries are added until the improvement in expected significance falls below 5%.
 
 
 
41
 
42
+ ## Architecture
 
 
 
43
 
44
+ We adopt a hybrid approach to automate the Higgs boson to diphoton data analysis. Given the relatively fixed workflow, we use `Snakemake` to orchestrate the sequential execution of analysis steps. A supervisor–coder agent is deployed to complete each step. This design balances the determinism of the analysis structure with the flexibility often required in HEP analyses.
 
 
45
 
46
+ ### Workflow management
 
47
 
48
+ `Snakemake` is a Python-based workflow management system that enables researchers to define computational workflows through rule-based specifications, where each rule describes how to generate specific output files from input files using defined scripts or commands. Snakemake serves as the workflow orchestration engine that manages the complex dependencies and execution order of the HEP analysis pipeline. In this package, `Snakemake` operates as the backbone that coordinates the five-stage analysis workflow for ATLAS diphoton data processing. This modular approach allows the complex physics analysis to be broken down into manageable, interdependent components that can be executed efficiently and reproducibly.
 
 
 
49
 
50
+ ### Supervisor–coder agent
 
 
 
51
 
52
+ We design a supervisor–coder agent to carry out each task, as illustrated in Fig.Β 1. The supervisor and coder are implemented as API calls to a large language model (LLM), with distinct system prompts tailored to their respective roles. The supervisor receives instructions from the human user, formulates corresponding directives for the coder, and reviews the coder’s output. The coder, in turn, takes the supervisor’s instructions and generates code to implement the required action. The generated code is executed through an execution engine, which records the execution trace.
53
 
54
+ The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
+ <!-- ![Illustration of internal workflow for the supervisor–coder agent.](supervisor_coder.pdf) -->
57
 
58
+ Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ For the initial user interaction with the supervisor, the input prompt includes a natural-language description of the task objectives, suggested implementation strategies, and a system prompt that constrains the behavior of the model output. The result of this interaction is an instruction passed to the coder, which in turn generates a Python script to be executed. If execution produces an error, the error message, the original supervisor instruction, the generated code, and a debugging system prompt are passed back to the supervisor in a follow-up API call. The supervisor then issues revised instructions to the coder to address the problem. This self-correction loop is repeated up to three times before the trial is deemed unsuccessful.
61
 
62
+ ## Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
+ To establish a baseline, we conducted 219 experiments with the `gemini-pro-2.5` model, providing high statistical precision across all analysis stages. In this configuration, the workflow was organized into three composite steps:
65
 
66
+ 1. **Data preparation** including *ntuple creation* and *preprocessing*
67
+ 2. **Signal–background (S-B) separation**
68
+ 3. **Categorization** based on the expected Higgs-to-photon significance
69
 
70
+ Up to five self-correction iterations were applied. The corresponding success rates for these three steps were **58 Β± 3%**, **88 Β± 2%**, and **74 Β± 3%**, respectively.
71
 
72
+ As summarized in Table 1, the **data-preparation** stage (combining ntuple production and preprocessing) was the most error-prone, with 93 failures out of 219 trials. Most issues stemmed from insufficient context for identifying objects within ROOT files. The dominant failure modes were Type 1 (zero event weights) and Type 5 (missing intermediate files), indicating persistent challenges in reasoning about file structures, dependencies, and runtime organization. Providing richer execution context such as package lists, file hierarchies, and metadata could help mitigate these problems. Despite these limitations, the overall completion rate (~57%) demonstrates that LLMs can autonomously execute complex domain-specific programming tasks a non-trivial fraction of the time.
 
 
 
 
 
 
73
 
74
+ The subsequent **S-B separation** using `TabPFN` and the final **categorization** for the Higgs-to-diphoton measurement exhibited substantially fewer failures (25 and 57, respectively). These errors were primarily Type 3 (function-calling) and Type 6 (semantic) issues, reflecting improved stability once the workflow reaches the model-training and evaluation stages.
75
 
76
+ | **Step** | **Type 1** | **Type 2** | **Type 3** | **Type 4** | **Type 5** | **Type 6** | **Type 7** |
77
+ |---------------------------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
78
+ | Data-preparation (93 failures) | 41 | 15 | 1 | 9 | 17 | 6 | 4 |
79
+ | S-B separation (25 failures) | 0 | 4 | 3 | 0 | 13 | 5 | 0 |
80
+ | Categorization (57 failures) | 0 | 4 | 26 | 0 | 6 | 17 | 4 |
81
 
82
+ *Table 1. Distribution of failure counts by error category for each workflow step, based on 219 trials per step.*
 
83
 
84
+ *Error type definitions:*
85
+ Type 1 β€” all data weights = 0
86
+ Type 2 β€” dummy data created
87
+ Type 3 β€” function-calling error
88
+ Type 4 β€” incorrect branch name
89
+ Type 5 β€” intermediate file not found
90
+ Type 6 β€” semantic error
91
+ Type 7 β€” other
92
 
93
+ To assess agent efficiency, we measured the ratio of user-input tokens to the total tokens exchanged with the API. For the `gemini-pro-2.5` baseline, this ratio was **(1.65 Β± 0.15) Γ— 10⁻³** for **data preparation**, **(1.43 Β± 0.10) Γ— 10⁻³** for **S-B separation**, and **(0.93 Β± 0.07) Γ— 10⁻³** for **categorization**. A higher ratio indicates more efficient task execution, as fewer internally generated tokens are required to complete the same instruction set.
94
 
95
+ Over 98% of tokens originated from the model’s autonomous reasoning and self-correction rather than direct user input, suggesting that additional task detail would minimally affect overall token cost. This ratio thus provides a simple diagnostic of communication efficiency and reasoning compactness.
 
 
 
96
 
97
+ Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
 
 
98
 
99
+ <!-- Insert figure here. -->
100
 
101
+ The results across models, summarized in FiguresΒ 1 andΒ 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other modelsβ€”each tested with approximately ten trials per stepβ€”the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
 
 
102
 
103
+ FigureΒ 1 summarizes the cross-model performance of the agentic workflow. The heatmap shows the success rate for each model–step pair, highlighting substantial variation in reliability across architectures. The dates appended to each model name denote the model release or update version used for testing, while the parameter in parentheses (e.g., β€œ17B”) indicates the model’s approximate parameter count in billions. Models in the `GPT-5` [OpenAI, 2025](#ref-openai-2025) and `Gemini` [Google, 2025](#ref-google-2025) families achieve the highest completion fractions across most steps, whereas smaller or open-weight models such as `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025) exhibit lower but still non-negligible success on certain steps. This demonstrates that the workflow can be executed end-to-end by multiple LLMs, though with markedly different robustness and learning stability.
 
104
 
105
+ Figure 2 shows the distribution of failure modes across all analysis stages, considering only trials that did not reach a successful completion. The categorization is based on the LLM-generated outputs themselves, capturing how each model typically fails. Error types include function-calling issues, missing or placeholder data, infrastructure problems (e.g., API or file-handling failures), and semantic errorsβ€”cases where the model misinterprets the prompt and produces runnable but incorrect code, as well as syntax mistakes.
 
106
 
107
+ Clear model-dependent patterns emerge. Models with higher success rates, such as the `GPT-5` [OpenAI, 2025](#ref-openai-2025) and `Gemini 2.5` [Google, 2025](#ref-google-2025) families, show fewer logic and syntax issues, while smaller or open-weight models exhibit more semantic and data-handling failures. These differences reveal characteristic failure signatures that complement overall completion rates.
 
 
108
 
109
+ Taken together, Figure 1 and 2 highlights two main aspects of LLM-agent behavior: overall task reliability and the characteristic error modes behind failed executions. These results show that models differ not only in success rate but also in the nature of their failures, providing insight into their robustness and limitations in realistic HEP workflows.
110
 
111
+ ## Limitations
112
 
113
+ This study shows that LLMs can support HEP data analysis workflows by interpreting natural language, generating executable code, and applying basic self-correction. While multi-step task planning is beyond the current scope, the `Snakemake` integration provides a natural path toward rule-based agent planning. Future work will pursue this direction and further strengthen the framework through improvements in prompting, agent design, domain adaptation, and retrieval-augmented generation.
114
 
115
+ ## Conclusion
 
 
 
116
 
117
+ We demonstrated the feasibility of employing LLM agents to automate components of high-energy physics (HEP) data analysis within a reproducible workflow. The proposed hybrid framework combines LLM-based task execution with deterministic workflow management, enabling near end-to-end analyses with limited human oversight. While the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025) served as a statistically stable reference, the broader cross-model evaluation shows that several contemporary LLMs can perform complex HEP tasks with distinct levels of reliability and robustness. Beyond this proof of concept, the framework provides a foundation for systematically assessing and improving LLM-agent performance in domain-specific scientific workflows.
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ---
120
 
121
+ ## References
122
+
123
+ - <span id="ref-atlas-collab-2020"></span> **ATLAS Collaboration.** *ATLAS simulated samples collection for jet reconstruction training, as part of the 2020 Open Data release.* CERN Open Data Portal (2020). [doi:10.7483/OPENDATA.ATLAS.L806.5CKU](http://doi.org/10.7483/OPENDATA.ATLAS.L806.5CKU)
124
+ - <span id="ref-cms-collab-2012"></span> **CMS Collaboration.** *Observation of a New Boson at a Mass of 125 GeV with the CMS Experiment at the LHC.* Phys. Lett. B 716, 30–61 (2012). [arXiv:1207.7235](https://arxiv.org/abs/1207.7235), [doi:10.1016/j.physletb.2012.08.021](https://doi.org/10.1016/j.physletb.2012.08.021)
125
+ - <span id="ref-antcheva-et-al-2009"></span> **Antcheva et al.** *ROOT: A C++ framework for petabyte data storage, statistical analysis and visualization.* Comput. Phys. Commun. 180, 2499–2512 (2009). [doi:10.1016/j.cpc.2009.08.005](https://doi.org/10.1016/j.cpc.2009.08.005)
126
+ - <span id="ref-hollmann-et-al-2025"></span> **Hollmann et al.** *Accurate predictions on small data with a tabular foundation model.* Nature 637, 319–326 (2025). [doi:10.1038/s41586-024-08328-6](https://doi.org/10.1038/s41586-024-08328-6)
127
+ - <span id="ref-xiao-et-al-2024"></span> **Xiao et al.** *CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis.* arXiv:2407.09811 (2024). [https://arxiv.org/abs/2407.09811](https://arxiv.org/abs/2407.09811)
128
+ - <span id="ref-wang-duan-2025"></span> **Wang & Duan.** *Empirical Research on Utilizing LLM-based Agents for Automated Bug Fixing via LangGraph.* arXiv:2502.18465 (2025). [https://arxiv.org/abs/2502.18465](https://arxiv.org/abs/2502.18465)
129
+ - <span id="ref-richmond-et-al-2025"></span> **Richmond et al.** *FeynTune: Large Language Models for High-Energy Theory.* arXiv:2508.03716 (2025). [https://arxiv.org/abs/2508.03716](https://arxiv.org/abs/2508.03716)
130
+ - <span id="ref-simons-2024"></span> **Simons.** *Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts...* arXiv:2411.14877 (2024). [https://arxiv.org/abs/2411.14877](https://arxiv.org/abs/2411.14877)
131
+ - <span id="ref-wu-et-al-2024"></span> **Wu et al.** *Scaling Particle Collision Data Analysis.* arXiv:2412.00129 (2024). [https://arxiv.org/abs/2412.00129](https://arxiv.org/abs/2412.00129)
132
+ - <span id="ref-molder-et-al-2021"></span> **MΓΆlder et al.** *Sustainable data analysis with Snakemake.* F1000Research 10, 33 (2021). [doi:10.12688/f1000research.29032.2](https://doi.org/10.12688/f1000research.29032.2)
133
+ - <span id="ref-chase-2022"></span> **Chase.** *LangChain: A Framework for Large Language Model Applications.* (2022). [https://langchain.com/](https://langchain.com/)
134
+ - <span id="ref-langchain-team-2025"></span> **LangChain Team.** *LangGraph: A Graph-Based Agent Workflow Framework.* (2025). [https://langchain-ai.github.io/langgraph/](https://langchain-ai.github.io/langgraph/)
135
+ - <span id="ref-google-2025"></span> **Google.** Gemini 2.5 Pro. (2025). [https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)
136
+ - <span id="ref-openai-2025"></span> **OpenAI.** GPT-5. (2025). [https://platform.openai.com/docs/models/gpt-5](https://platform.openai.com/docs/models/gpt-5)
137
+ - <span id="ref-anthropic-2024"></span> **Anthropic.** Claude 3.5 Sonnet (2024). [https://www.anthropic.com/news/claude-3-5-sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)
138
+ - <span id="ref-alibaba-2025"></span> **Alibaba.** Qwen3. (2025). [https://qwenlm.github.io/blog/qwen3/](https://qwenlm.github.io/blog/qwen3/)
139
+ - <span id="ref-openai-et-al-2025"></span> **OpenAI et al.** gpt-oss-120b & gpt-oss-20b Model Card (2025). [arXiv:2508.10925](https://arxiv.org/abs/2508.10925)
 
SETUP.md ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Large Language Model Analysis Framework for High Energy Physics
2
+
3
+ A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
4
+
5
+ ## Table of Contents
6
+ - [Setup](#setup)
7
+ - [Data and Solution](#data-and-solution)
8
+ - [Running Tests](#running-tests)
9
+ - [Analysis and Visualization](#analysis-and-visualization)
10
+ - [Project Structure](#project-structure)
11
+ - [Advanced Usage](#advanced-usage)
12
+
13
+ ---
14
+
15
+ ## Setup
16
+
17
+ ### Prerequisites
18
+
19
+ **CBORG API Access Required**
20
+
21
+ This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
22
+
23
+ 1. Access to the CBORG API (contact LBL for access)
24
+ 2. A CBORG API key
25
+ 3. Network access to the CBORG API endpoint
26
+
27
+ **Note for External Users:** CBORG is an internal LBL system. External users may need to:
28
+ - Request guest access through LBL collaborations
29
+ - Adapt the code to use OpenAI API directly (requires code modifications)
30
+ - Contact the repository maintainers for alternative deployment options
31
+
32
+ ### Environment Setup
33
+ Create Conda environment:
34
+ ```bash
35
+ mamba env create -f environment.yml
36
+ conda activate llm_env
37
+ ```
38
+
39
+ ### API Configuration
40
+ Create script `~/.apikeys.sh` to export CBORG API key:
41
+ ```bash
42
+ export CBORG_API_KEY="INSERT_API_KEY"
43
+ ```
44
+
45
+ Then source it before running tests:
46
+ ```bash
47
+ source ~/.apikeys.sh
48
+ ```
49
+
50
+ ### Initial Configuration
51
+
52
+ Before running tests, set up your configuration files:
53
+
54
+ ```bash
55
+ # Copy example configuration files
56
+ cp config.example.yml config.yml
57
+ cp models.example.txt models.txt
58
+
59
+ # Edit config.yml to set your preferred models and parameters
60
+ # Edit models.txt to list models you want to test
61
+ ```
62
+
63
+ **Important:** The `models.txt` file must end with a blank line.
64
+
65
+ ---
66
+
67
+ ## Data and Solution
68
+
69
+ ### ATLAS Open Data Samples
70
+ All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
71
+ ```
72
+ /global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
73
+ ```
74
+
75
+ **Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
76
+ ```bash
77
+ chmod -R a-w /path/to/data/directory
78
+ ```
79
+
80
+ ### Reference Solution
81
+ - Navigate to `solution/` directory and run `python soln.py`
82
+ - Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
83
+
84
+ ### Reference Arrays for Validation
85
+ Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
86
+
87
+ **Quick fetch from repo root:**
88
+ ```bash
89
+ bash scripts/fetch_solution_arrays.sh
90
+ ```
91
+
92
+ **Or copy from NERSC shared path:**
93
+ ```
94
+ /global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Running Tests
100
+
101
+ ### Model Configuration
102
+
103
+ Three model list files control testing:
104
+ - **`models.txt`**: Models for sequential testing
105
+ - **`models_supervisor.txt`**: Supervisor models for paired testing
106
+ - **`models_coder.txt`**: Coder models for paired testing
107
+
108
+ **Important formatting rules:**
109
+ - One model per line
110
+ - File must end with a blank line
111
+ - Repeat model names for multiple trials
112
+ - Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
113
+
114
+ See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
115
+
116
+ ### Testing Workflows
117
+
118
+ #### 1. Sequential Testing (Single Model at a Time)
119
+ ```bash
120
+ bash test_models.sh output_dir_name
121
+ ```
122
+ Tests all models in `models.txt` sequentially.
123
+
124
+ #### 2. Parallel Testing (Multiple Models)
125
+ ```bash
126
+ # Basic parallel execution
127
+ bash test_models_parallel.sh output_dir_name
128
+
129
+ # GNU Parallel (recommended for large-scale testing)
130
+ bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
131
+
132
+ # Examples:
133
+ bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
134
+ bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
135
+ bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
136
+ ```
137
+
138
+ **GNU Parallel features:**
139
+ - Scales to 20-30 models with 200-300 total parallel jobs
140
+ - Automatic resource management
141
+ - Fast I/O using `/dev/shm` temporary workspace
142
+ - Comprehensive error handling and logging
143
+
144
+ #### 3. Step-by-Step Testing with Validation
145
+ ```bash
146
+ # Run all 5 steps with validation
147
+ ./run_smk_sequential.sh --validate
148
+
149
+ # Run specific steps
150
+ ./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
151
+
152
+ # Run individual steps
153
+ ./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
154
+ ./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
155
+ ./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
156
+ ./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
157
+ ./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
158
+
159
+ # Custom output directory
160
+ ./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
161
+ ```
162
+
163
+ **Directory naming options:**
164
+ - `--job-id ID`: Creates `results_job_ID/`
165
+ - `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
166
+ - `--out-dir DIR`: Custom directory name
167
+
168
+ ### Validation
169
+
170
+ **Automatic validation (during execution):**
171
+ ```bash
172
+ ./run_smk_sequential.sh --step1 --step2 --validate
173
+ ```
174
+ Validation logs saved to `{output_dir}/logs/*_validation.log`
175
+
176
+ **Manual validation (after execution):**
177
+ ```bash
178
+ # Validate all steps
179
+ python check_soln.py --out_dir results_job_002
180
+
181
+ # Validate specific step
182
+ python check_soln.py --out_dir results_job_002 --step 2
183
+ ```
184
+
185
+ **Validation features:**
186
+ - βœ… Adaptive tolerance with 4 significant digit precision
187
+ - πŸ“Š Column-by-column difference analysis
188
+ - πŸ“‹ Side-by-side value comparison
189
+ - 🎯 Clear, actionable error messages
190
+
191
+ ### Speed Optimization
192
+
193
+ Reduce iteration counts in `config.yml`:
194
+ ```yaml
195
+ # Limit LLM coder attempts (default 10)
196
+ max_iterations: 3
197
+ ```
198
+
199
+ ---
200
+
201
+ ## Analysis and Visualization
202
+
203
+ ### Results Summary
204
+ All test results are aggregated in:
205
+ ```
206
+ results_summary.csv
207
+ ```
208
+
209
+ **Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
210
+
211
+ ### Error Analysis and Categorization
212
+
213
+ **Automated error analysis:**
214
+ ```bash
215
+ python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
216
+ ```
217
+
218
+ Uses LLM to analyze comprehensive logs and categorize errors into:
219
+ - Semantic errors
220
+ - Function-calling errors
221
+ - Intermediate file not found
222
+ - Incorrect branch name
223
+ - OpenAI API errors
224
+ - Data quality issues (all weights = 0)
225
+ - Other/uncategorized
226
+
227
+ ### Interactive Analysis Notebooks
228
+
229
+ #### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
230
+ Comprehensive analysis of model performance across all 5 workflow steps:
231
+ - **Success rate heatmap** (models Γ— steps)
232
+ - **Agent work progression** (iterations over steps)
233
+ - **API call statistics** (by step and model)
234
+ - **Cost analysis** (input/output tokens, estimated pricing)
235
+
236
+ **Output plots:**
237
+ - `plots/1_success_rate_heatmap.pdf`
238
+ - `plots/2_agent_work_line_plot.pdf`
239
+ - `plots/3_api_calls_line_plot.pdf`
240
+ - `plots/4_cost_per_step.pdf`
241
+ - `plots/five_step_summary_stats.csv`
242
+
243
+ #### 2. Error Category Analysis (`error_analysis.ipynb`)
244
+ Deep dive into error patterns and failure modes:
245
+ - **Normalized error distribution** (stacked bar chart with percentages)
246
+ - **Error type heatmap** (models Γ— error categories)
247
+ - **Top model breakdowns** (faceted plots for top 9 models)
248
+ - **Error trends across steps** (stacked area chart)
249
+
250
+ **Output plots:**
251
+ - `plots/error_distribution_by_model.pdf`
252
+ - `plots/error_heatmap_by_model.pdf`
253
+ - `plots/error_categories_top_models.pdf`
254
+ - `plots/errors_by_step.pdf`
255
+
256
+ #### 3. Quick Statistics (`plot_stats.ipynb`)
257
+ Legacy notebook for basic statistics visualization.
258
+
259
+ ### Log Interpretation
260
+
261
+ **Automated log analysis:**
262
+ ```bash
263
+ python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
264
+ ```
265
+
266
+ Analyzes comprehensive supervisor-coder logs to identify:
267
+ - Root causes of failures
268
+ - Responsible parties (user, supervisor, coder, external)
269
+ - Error patterns across iterations
270
+
271
+ ---
272
+
273
+ ## Project Structure
274
+
275
+ ### Core Scripts
276
+ - **`supervisor_coder.py`**: Supervisor-coder framework implementation
277
+ - **`check_soln.py`**: Solution validation with enhanced comparison
278
+ - **`write_prompt.py`**: Prompt management and context chaining
279
+ - **`update_stats.py`**: Statistics tracking and CSV updates
280
+ - **`error_analysis.py`**: LLM-powered error categorization
281
+
282
+ ### Test Runners
283
+ - **`test_models.sh`**: Sequential model testing
284
+ - **`test_models_parallel.sh`**: Parallel testing (basic)
285
+ - **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
286
+ - **`test_stats.sh`**: Individual model statistics
287
+ - **`test_stats_parallel.sh`**: Parallel step execution
288
+ - **`run_smk_sequential.sh`**: Step-by-step workflow runner
289
+
290
+ ### Snakemake Workflows (`workflow/`)
291
+ The analysis workflow is divided into 5 sequential steps:
292
+
293
+ 1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
294
+ 2. **`create_numpy.smk`**: Convert ROOT β†’ NumPy arrays
295
+ 3. **`preprocess.smk`**: Apply preprocessing transformations
296
+ 4. **`scores.smk`**: Compute signal/background classification scores
297
+ 5. **`categorization.smk`**: Final categorization and statistical analysis
298
+
299
+ **Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
300
+
301
+ ### Prompts (`prompts/`)
302
+ - `summarize_root.txt`: Step 1 task description
303
+ - `create_numpy.txt`: Step 2 task description
304
+ - `preprocess.txt`: Step 3 task description
305
+ - `scores.txt`: Step 4 task description
306
+ - `categorization.txt`: Step 5 task description
307
+ - `supervisor_first_call.txt`: Initial supervisor instructions
308
+ - `supervisor_call.txt`: Subsequent supervisor instructions
309
+
310
+ ### Utility Scripts (`util/`)
311
+ - **`inspect_root.py`**: ROOT file inspection tools
312
+ - **`analyze_particles.py`**: Particle-level analysis
313
+ - **`compare_arrays.py`**: NumPy array comparison utilities
314
+
315
+ ### Model Documentation
316
+ - **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β†’ actual model mappings
317
+ - **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
318
+ - **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
319
+ - **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
320
+
321
+ ### Analysis Notebooks
322
+ - **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
323
+ - **`error_analysis.ipynb`**: Error categorization and pattern analysis
324
+ - **`error_analysis_plotting.ipynb`**: Additional error visualizations
325
+ - **`plot_stats.ipynb`**: Legacy statistics plots
326
+
327
+ ### Output Structure
328
+ Each test run creates:
329
+ ```
330
+ output_name/
331
+ β”œβ”€β”€ model_timestamp/
332
+ β”‚ β”œβ”€β”€ generated_code/ # LLM-generated Python scripts
333
+ β”‚ β”œβ”€β”€ logs/ # Execution logs and supervisor records
334
+ β”‚ β”œβ”€β”€ arrays/ # NumPy arrays produced by generated code
335
+ β”‚ β”œβ”€β”€ plots/ # Comparison plots (generated vs. solution)
336
+ β”‚ β”œβ”€β”€ prompt_pairs/ # User + supervisor prompts
337
+ β”‚ β”œβ”€β”€ results/ # Temporary ROOT files (job-scoped)
338
+ β”‚ └── snakemake_log/ # Snakemake execution logs
339
+ ```
340
+
341
+ **Job-scoped ROOT outputs:**
342
+ - Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
343
+ - Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
344
+ - Automatically cleaned after significance calculation
345
+
346
+ ---
347
+
348
+ ## Advanced Usage
349
+
350
+ ### Supervisor-Coder Configuration
351
+
352
+ Control iteration limits in `config.yml`:
353
+ ```yaml
354
+ model: 'anthropic/claude-sonnet:latest'
355
+ name: 'experiment_name'
356
+ out_dir: 'results/experiment_name'
357
+ max_iterations: 10 # Maximum supervisor-coder iterations per step
358
+ ```
359
+
360
+ ### Parallel Execution Tuning
361
+
362
+ For `test_models_parallel_gnu.sh`:
363
+ ```bash
364
+ # Syntax:
365
+ bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
366
+
367
+ # Conservative (safe for shared systems):
368
+ bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
369
+
370
+ # Aggressive (dedicated nodes):
371
+ bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
372
+ ```
373
+
374
+ ### Custom Validation
375
+
376
+ Run validation on specific steps or with custom tolerances:
377
+ ```bash
378
+ # Validate only data conversion step
379
+ python check_soln.py --out_dir results/ --step 2
380
+
381
+ # Check multiple specific steps
382
+ python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
383
+ ```
384
+
385
+ ### Log Analysis Pipeline
386
+
387
+ ```bash
388
+ # 1. Run tests
389
+ bash test_models_parallel_gnu.sh experiment1 5 5
390
+
391
+ # 2. Analyze logs with LLM
392
+ python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
393
+
394
+ # 3. Categorize errors
395
+ python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
396
+
397
+ # 4. Generate visualizations
398
+ jupyter notebook error_analysis.ipynb
399
+ ```
400
+
401
+ ---
402
+
403
+ ## Roadmap and Future Directions
404
+
405
+ ### Planned Improvements
406
+
407
+ **Prompt Engineering:**
408
+ - Auto-load context (file lists, logs) at step start
409
+ - Provide comprehensive inputs/outputs/summaries upfront
410
+ - Develop prompt-management layer for cross-analysis reuse
411
+
412
+ **Validation & Monitoring:**
413
+ - Embed validation in workflows for immediate error detection
414
+ - Record input/output and state transitions for reproducibility
415
+ - Enhanced situation awareness through comprehensive logging
416
+
417
+ **Multi-Analysis Extension:**
418
+ - Rerun H→γγ with improved system prompts
419
+ - Extend to H→4ℓ and other Higgs+X channels
420
+ - Provide learned materials from previous analyses as reference
421
+
422
+ **Self-Improvement:**
423
+ - Reinforcement learning–style feedback loops
424
+ - Agent-driven prompt refinement
425
+ - Automatic generalization across HEP analyses
426
+
427
+ ---
428
+
429
+ ## Citation and Acknowledgments
430
+
431
+ This framework tests LLM agents on ATLAS Open Data from:
432
+ - 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
433
+
434
+ Models tested via CBORG API (Lawrence Berkeley National Laboratory).
435
+
436
+ ---
437
+
438
+ ## Support and Contributing
439
+
440
+ For questions or issues:
441
+ 1. Check existing documentation in `*.md` files
442
+ 2. Review example configurations in `config.yml`
443
+ 3. Examine validation logs in output directories
444
+
445
+ For contributions, please ensure:
446
+ - Model lists end with blank lines
447
+ - Prompts follow established format
448
+ - Validation passes for all test cases