IntelliTeX: Natural Language β†’ LaTeX (Experimental)

Model summary

IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

  • Base model: Salesforce/codet5p-220m (CodeT5+ 220M)
  • Primary task: text β†’ LaTeX equation generation (single equation output)
  • Primary language: English

What the model is for

Intended use

  • Drafting LaTeX equations from short natural-language descriptions
  • Prototyping or benchmarking compact models on domain-specific translation

Not recommended

  • Fully automated formula generation without verification

Training approach (experimental study)

We evaluated multiple training configurations to understand what improves a compact model most:

  1. LoRA fine-tuning: rapid iteration and capability checks
  2. Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
  3. Two-stage pipeline (continued pretraining β†’ FPFT) inspired by CodeT5+ training recipes:
    • Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
      • ~4B tokens, ~76k steps
    • Stage 2: supervised FPFT on Speech2LaTeX textβ†’LaTeX pairs

Experiment Results

  • Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4Γ— higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
  • On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
  • Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
  • On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.

Main benchmark (Speech2LaTeX test set)

  • Qwen2.5-Coder 32B: EM 0.121
  • FPFT Qwen2.5-Coder 0.5B: EM 0.405
  • Stage 1 + FPFT CodeT5+ 220M: EM 0.463
  • FPFT CodeT5+ 220M: EM 0.467
  • FPFT Qwen2.5-Coder 3B: EM 0.507

Stress tests (MathBridge subsets)

  • Long-context inputs (source length > 115 chars):
    • FPFT CodeT5+ 220M: EM 0.150
    • Stage 1 + FPFT CodeT5+ 220M: EM 0.195
    • FPFT Qwen2.5-Coder 3B: EM 0.209
  • Long-target outputs (target length > 60 chars):
    • FPFT CodeT5+ 220M: EM 0.049
    • FPFT Qwen2.5-Coder 3B: EM 0.070
    • Stage 1 + FPFT CodeT5+ 220M: EM 0.076

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

inputs = tok(prompt, return_tensors="pt")

out = model.generate(
    **inputs,
    max_length=512,
)

print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$

Running on Transformer.js

A live, in-browser demonstration using the transformer.js library showcases this efficiency advantage on typical CPU hardware.

Model
IntelliTeX our
Qwen2.5-Coder-0.5B-Instruct qwen

Full Evaluation Results

1. Comprehensive performance on the S2L test dataset (2745 samples)

Model Architecture Method EM ↑ CR ↑ CER ↓ TexBLEU ↑
SmolLM2 (135M) Base 0.005 0.790 42.40 0.743
Base + Grammar 0.011 0.822 7.90 0.279
LoRA 0.126 0.957 0.90 0.823
LoRA + Grammar 0.127 0.957 0.91 0.824
SmolLM2 (360M) Base 0.107 0.695 9.38 0.802
Base + Grammar 0.142 0.760 10.00 0.812
LoRA 0.242 0.980 0.49 0.861
LoRA + Grammar 0.243 0.980 0.49 0.862
CodeT5+ (220M) Base 0.000 0.921 96.01 0.725
LoRA 0.258 0.913 0.39 0.874
FPFT 0.467 0.982 0.22 0.912
Stage 1 + FPFT (IntelliTeX) 0.463 0.998 0.22 0.915
Qwen2.5-Coder (0.5B) Base 0.161 0.974 1.27 0.830
Base + Grammar 0.160 0.978 1.27 0.831
LoRA 0.155 0.909 2.71 0.836
LoRA + Grammar 0.155 0.967 1.75 0.838
FPFT 0.405 0.990 0.24 0.902
Qwen2.5-Coder (3B) Base 0.294 0.991 0.46 0.869
Base + Grammar 0.293 0.996 0.45 0.870
FPFT 0.507 0.997 0.18 0.919
Qwen2.5-Coder (32B) Base 0.121 1.000 0.38 0.863

Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.

2. Stress Test Analysis

Performance on Long Context Inputs (Source > 115 chars)

Demonstrates the model's ability to understand lengthy natural language descriptions.

Model (FPFT) EM ↑ CR ↑ CER ↓ TexBLEU ↑
CodeT5+ (220M) 0.150 0.967 0.219 0.868
IntelliTeX (Stage 1 + FPFT) 0.195 0.997 0.211 0.873
Qwen2.5-Coder (0.5B) 0.129 0.976 0.292 0.859
Qwen2.5-Coder (3B) 0.209 0.996 0.199 0.874

Performance on Long Sequence Generation (Target > 60 chars)

Demonstrates the model's ability to generate complex, long LaTeX formulas.

Model (FPFT) EM ↑ CR ↑ CER ↓ TexBLEU ↑
CodeT5+ (220M) 0.049 0.940 0.297 0.827
IntelliTeX (Stage 1 + FPFT) 0.076 0.991 0.312 0.828
Qwen2.5-Coder (0.5B) 0.037 0.967 0.394 0.816
Qwen2.5-Coder (3B) 0.070 0.988 0.350 0.822
Downloads last month
64
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for duanxianpi/IntelliTex

Finetuned
(90)
this model

Datasets used to train duanxianpi/IntelliTex