IntelliTeX: Natural Language → LaTeX (Experimental)

Model summary

IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

Base model: Salesforce/codet5p-220m (CodeT5+ 220M)
Primary task: text → LaTeX equation generation (single equation output)
Primary language: English

What the model is for

Intended use

Drafting LaTeX equations from short natural-language descriptions
Prototyping or benchmarking compact models on domain-specific translation

Not recommended

Fully automated formula generation without verification

Training approach (experimental study)

We evaluated multiple training configurations to understand what improves a compact model most:

LoRA fine-tuning: rapid iteration and capability checks
Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
Two-stage pipeline (continued pretraining → FPFT) inspired by CodeT5+ training recipes:
- Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
  - ~4B tokens, ~76k steps
- Stage 2: supervised FPFT on Speech2LaTeX text→LaTeX pairs

Experiment Results

Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4× higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.

Main benchmark (Speech2LaTeX test set)

Qwen2.5-Coder 32B: EM 0.121
FPFT Qwen2.5-Coder 0.5B: EM 0.405
Stage 1 + FPFT CodeT5+ 220M: EM 0.463
FPFT CodeT5+ 220M: EM 0.467
FPFT Qwen2.5-Coder 3B: EM 0.507

Stress tests (MathBridge subsets)

Long-context inputs (source length > 115 chars):
- FPFT CodeT5+ 220M: EM 0.150
- Stage 1 + FPFT CodeT5+ 220M: EM 0.195
- FPFT Qwen2.5-Coder 3B: EM 0.209
Long-target outputs (target length > 60 chars):
- FPFT CodeT5+ 220M: EM 0.049
- FPFT Qwen2.5-Coder 3B: EM 0.070
- Stage 1 + FPFT CodeT5+ 220M: EM 0.076

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

inputs = tok(prompt, return_tensors="pt")

out = model.generate(
    **inputs,
    max_length=512,
)

print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$

Running on Transformer.js

A live, in-browser demonstration using the transformer.js library showcases this efficiency advantage on typical CPU hardware.

Model
IntelliTeX
Qwen2.5-Coder-0.5B-Instruct

Full Evaluation Results

1. Comprehensive performance on the S2L test dataset (2745 samples)

Model Architecture	Method	EM ↑	CR ↑	CER ↓	TexBLEU ↑
SmolLM2 (135M)	Base	0.005	0.790	42.40	0.743
	Base + Grammar	0.011	0.822	7.90	0.279
	LoRA	0.126	0.957	0.90	0.823
	LoRA + Grammar	0.127	0.957	0.91	0.824
SmolLM2 (360M)	Base	0.107	0.695	9.38	0.802
	Base + Grammar	0.142	0.760	10.00	0.812
	LoRA	0.242	0.980	0.49	0.861
	LoRA + Grammar	0.243	0.980	0.49	0.862
CodeT5+ (220M)	Base	0.000	0.921	96.01	0.725
	LoRA	0.258	0.913	0.39	0.874
	FPFT	0.467	0.982	0.22	0.912
	Stage 1 + FPFT (IntelliTeX)	0.463	0.998	0.22	0.915
Qwen2.5-Coder (0.5B)	Base	0.161	0.974	1.27	0.830
	Base + Grammar	0.160	0.978	1.27	0.831
	LoRA	0.155	0.909	2.71	0.836
	LoRA + Grammar	0.155	0.967	1.75	0.838
	FPFT	0.405	0.990	0.24	0.902
Qwen2.5-Coder (3B)	Base	0.294	0.991	0.46	0.869
	Base + Grammar	0.293	0.996	0.45	0.870
	FPFT	0.507	0.997	0.18	0.919
Qwen2.5-Coder (32B)	Base	0.121	1.000	0.38	0.863

Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.

2. Stress Test Analysis

Performance on Long Context Inputs (Source > 115 chars)

Demonstrates the model's ability to understand lengthy natural language descriptions.

Model (FPFT)	EM ↑	CR ↑	CER ↓	TexBLEU ↑
CodeT5+ (220M)	0.150	0.967	0.219	0.868
IntelliTeX (Stage 1 + FPFT)	0.195	0.997	0.211	0.873
Qwen2.5-Coder (0.5B)	0.129	0.976	0.292	0.859
Qwen2.5-Coder (3B)	0.209	0.996	0.199	0.874

Performance on Long Sequence Generation (Target > 60 chars)

Demonstrates the model's ability to generate complex, long LaTeX formulas.

Model (FPFT)	EM ↑	CR ↑	CER ↓	TexBLEU ↑
CodeT5+ (220M)	0.049	0.940	0.297	0.827
IntelliTeX (Stage 1 + FPFT)	0.076	0.991	0.312	0.828
Qwen2.5-Coder (0.5B)	0.037	0.967	0.394	0.816
Qwen2.5-Coder (3B)	0.070	0.988	0.350	0.822

Downloads last month: 64

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for duanxianpi/IntelliTex

Base model

Salesforce/codet5p-220m

Finetuned

(90)

this model

duanxianpi
/

IntelliTex