--- license: mit language: - en base_model: - Salesforce/codet5-large tags: - ARC-AGI - ARC - code datasets: - mindware/arc-mega - Open-Orca/SlimOrca - camel-ai/math - skeskinen/TinyStories-GPT4 - rajpurkar/squad_v2 - garage-bAInd/Open-Platypus - Sharathhebbar24/arxiv-math-instruct-50k - AlgorithmicResearchGroup/arxiv-physics-instruct-tune-30k - TIGER-Lab/MathInstruct - neoneye/histogram-comparisons-small-v1 - ise-uiuc/Magicoder-Evol-Instruct-110K - PrimeIntellect/INTELLECT-MATH-SFT-Data - PrimeIntellect/verifiable-math-problems - sethapun/arithmetic_2md_1to1000 - EleutherAI/proof-pile-2 - MMInstruction/M3IT - stingning/ultrachat - timdettmers/openassistant-guanaco - Dahoas/instruct-synthetic-prompt-responses - pankajmathur/WizardLM_Orca --- This checkpoint is the primary CodeT5-based solver we used for the MindsAI @ Tufa Labs entry in the ARC Prize 2025 competition. It shares the same architecture as `mindware/arc-codet5-660m-scr` (a 16-layer decoder variant of `Salesforce/codet5-large`), but *does not* include the Span-Corruption Refinement (SCR) auxiliary training stage. Instead, it represents the best non-refinement checkpoint obtained during long-horizon pretraining on TPU-v4 systems. - **No SCR stage**: this model was trained purely with the original span-corruption + instruction fine-tuning curriculum + ARC fine tunining. - **Decoder-only pruning**: the original decoder depth (24) was reduced to 16 layers after experiments showed encoder pruning harmed sample efficiency, while decoder pruning could be recovered through extended training. - **Long-run TPU training**: training spanned roughly two years on a V4-64 TPU, made possible by Google’s TPU Research Cloud program. 📚 **ARC-Related Datasets & Frameworks** - [RE-ARC](https://github.com/michaelhodel/re-arc) — procedurally generates examples for the 400 ARC training tasks (we also include RE-ARC eval + ARC 1.5). - [ConceptARC](https://github.com/victorvikram/ConceptARC) - [1D-ARC](https://khalil-research.github.io/LLM4ARC/) - ARC_gym, Sort-of-ARC - Andreas Koepf’s generator suites (includes RE-ARC-style grids, code generation targets, and solution graphs). - Jack Cole’s custom generators covering ~70 tasks plus larger concept sets (cellular automata, math-derived boards, etc.). Several auxiliary datasets predict task metadata (graphs, heuristics, explanations) rather than final boards; they are part of the broader instruction mixture this model saw during pretraining. ## ARC Data Formatting - ARC tasks ship as JSON where each `task_id` contains `train` pairs and `test` inputs; every grid is a rectangular list of lists with integers `0-9`. Dimensions follow the original 1×1–30×30 spec, though the evaluator accepts up to 50×50. - Example task payload: ```json { "task_id": { "train": [ {"input": [[0,0],[1,1]], "output": [[1,1],[1,1]]} ], "test": [ {"input": [[0,0,0],[0,1,0],[0,0,0]]} ] } } ``` - Model prompts (`prompt` column during training/TTT/inference) are serialized text strings: `solve: train input1 output1 . … test tinput1 toutput1 `. Each grid token `` / `` / `` is produced by `grid_to_string`, so rows are concatenated digits separated by spaces. Multiple train examples increment the index (`input2`, `output2`, etc.). - Prompt example: ```text solve: train input1 000 010 000 output1 11 3 3 10 111 101 111. input2 00 02 output2 5 2 2 20 22 20. test tinput1 0000 0300 0000 0000 toutput1 ``` - Model targets (`correct_answer` column and expected decoder output before post-processing) follow `output_prefix` semantics: ` {total_chars} {height} {width} {symbols} {row_strings}.` Here `total_chars = height*width + (height - 1)` and `symbols` is the deduplicated sequence of colors as they are first encountered when scanning the board row-major; that rule applies to every output grid we emit (training outputs inside the prompt and the predicted test toutput). Example target string for a 3×3 donut: ```text 11 3 3 10 111 101 111. ```