IntelliTeX: Natural Language β LaTeX (Experimental)
Model summary
IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.
- Base model:
Salesforce/codet5p-220m(CodeT5+ 220M) - Primary task: text β LaTeX equation generation (single equation output)
- Primary language: English
What the model is for
Intended use
- Drafting LaTeX equations from short natural-language descriptions
- Prototyping or benchmarking compact models on domain-specific translation
Not recommended
- Fully automated formula generation without verification
Training approach (experimental study)
We evaluated multiple training configurations to understand what improves a compact model most:
- LoRA fine-tuning: rapid iteration and capability checks
- Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
- Two-stage pipeline (continued pretraining β FPFT) inspired by CodeT5+ training recipes:
- Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
- ~4B tokens, ~76k steps
- Stage 2: supervised FPFT on Speech2LaTeX textβLaTeX pairs
- Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
Experiment Results
- Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4Γ higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
- On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
- Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
- On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.
Main benchmark (Speech2LaTeX test set)
- Qwen2.5-Coder 32B: EM 0.121
- FPFT Qwen2.5-Coder 0.5B: EM 0.405
- Stage 1 + FPFT CodeT5+ 220M: EM 0.463
- FPFT CodeT5+ 220M: EM 0.467
- FPFT Qwen2.5-Coder 3B: EM 0.507
Stress tests (MathBridge subsets)
- Long-context inputs (source length > 115 chars):
- FPFT CodeT5+ 220M: EM 0.150
- Stage 1 + FPFT CodeT5+ 220M: EM 0.195
- FPFT Qwen2.5-Coder 3B: EM 0.209
- Long-target outputs (target length > 60 chars):
- FPFT CodeT5+ 220M: EM 0.049
- FPFT Qwen2.5-Coder 3B: EM 0.070
- Stage 1 + FPFT CodeT5+ 220M: EM 0.076
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"
inputs = tok(prompt, return_tensors="pt")
out = model.generate(
**inputs,
max_length=512,
)
print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$
Running on Transformer.js
A live, in-browser demonstration using the transformer.js library showcases this efficiency advantage on typical CPU hardware.
Full Evaluation Results
1. Comprehensive performance on the S2L test dataset (2745 samples)
| Model Architecture | Method | EM β | CR β | CER β | TexBLEU β |
|---|---|---|---|---|---|
| SmolLM2 (135M) | Base | 0.005 | 0.790 | 42.40 | 0.743 |
| Base + Grammar | 0.011 | 0.822 | 7.90 | 0.279 | |
| LoRA | 0.126 | 0.957 | 0.90 | 0.823 | |
| LoRA + Grammar | 0.127 | 0.957 | 0.91 | 0.824 | |
| SmolLM2 (360M) | Base | 0.107 | 0.695 | 9.38 | 0.802 |
| Base + Grammar | 0.142 | 0.760 | 10.00 | 0.812 | |
| LoRA | 0.242 | 0.980 | 0.49 | 0.861 | |
| LoRA + Grammar | 0.243 | 0.980 | 0.49 | 0.862 | |
| CodeT5+ (220M) | Base | 0.000 | 0.921 | 96.01 | 0.725 |
| LoRA | 0.258 | 0.913 | 0.39 | 0.874 | |
| FPFT | 0.467 | 0.982 | 0.22 | 0.912 | |
| Stage 1 + FPFT (IntelliTeX) | 0.463 | 0.998 | 0.22 | 0.915 | |
| Qwen2.5-Coder (0.5B) | Base | 0.161 | 0.974 | 1.27 | 0.830 |
| Base + Grammar | 0.160 | 0.978 | 1.27 | 0.831 | |
| LoRA | 0.155 | 0.909 | 2.71 | 0.836 | |
| LoRA + Grammar | 0.155 | 0.967 | 1.75 | 0.838 | |
| FPFT | 0.405 | 0.990 | 0.24 | 0.902 | |
| Qwen2.5-Coder (3B) | Base | 0.294 | 0.991 | 0.46 | 0.869 |
| Base + Grammar | 0.293 | 0.996 | 0.45 | 0.870 | |
| FPFT | 0.507 | 0.997 | 0.18 | 0.919 | |
| Qwen2.5-Coder (32B) | Base | 0.121 | 1.000 | 0.38 | 0.863 |
Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.
2. Stress Test Analysis
Performance on Long Context Inputs (Source > 115 chars)
Demonstrates the model's ability to understand lengthy natural language descriptions.
| Model (FPFT) | EM β | CR β | CER β | TexBLEU β |
|---|---|---|---|---|
| CodeT5+ (220M) | 0.150 | 0.967 | 0.219 | 0.868 |
| IntelliTeX (Stage 1 + FPFT) | 0.195 | 0.997 | 0.211 | 0.873 |
| Qwen2.5-Coder (0.5B) | 0.129 | 0.976 | 0.292 | 0.859 |
| Qwen2.5-Coder (3B) | 0.209 | 0.996 | 0.199 | 0.874 |
Performance on Long Sequence Generation (Target > 60 chars)
Demonstrates the model's ability to generate complex, long LaTeX formulas.
| Model (FPFT) | EM β | CR β | CER β | TexBLEU β |
|---|---|---|---|---|
| CodeT5+ (220M) | 0.049 | 0.940 | 0.297 | 0.827 |
| IntelliTeX (Stage 1 + FPFT) | 0.076 | 0.991 | 0.312 | 0.828 |
| Qwen2.5-Coder (0.5B) | 0.037 | 0.967 | 0.394 | 0.816 |
| Qwen2.5-Coder (3B) | 0.070 | 0.988 | 0.350 | 0.822 |
- Downloads last month
- 64
Model tree for duanxianpi/IntelliTex
Base model
Salesforce/codet5p-220m
