neo-3-1B-A90M-Instruct

This is the instruction-tuned version of neo-3-1B-A90M-Base. For larger context and stronger chain-of-thought, see neo-3-3B-A400M-Base and the neo-3-3B-A400M-Thinking model.

The neo-3-1B-A90M-Instruct model is a decoder-only sparse MoE model focused on chat-style instruction following, practical reasoning, and light code/math usage on commodity GPUs. It is trained on top of the neo-3-1B-A90M base checkpoint with supervised instruction data and light preference-style alignment, while preserving the efficiency profile of ~90M active parameters per token.

Core properties:

1B total parameters, ~90M active parameters (top-2-of-8 experts per token).
8K context window suitable for multi-step reasoning, small tools pipelines, and code editing sessions.
Mixtral-style MoE FFNs with grouped-query attention and RoPE.

The model is released under the MIT license and is intended as a compact, open, and easily finetunable instruction model suitable for Colab, consumer GPUs, and research setups.

Intended use

General assistant: Q&A, explanations, drafting, brainstorming, and everyday chat.
Light reasoning: step-by-step math, small puzzles, pros/cons analysis, and short chain-of-thought traces when prompted.
Code and tooling: code snippets, simple refactors, short scripts, and function-level suggestions.
Research and teaching: MoE experiments, scaling-law studies, and instruction-tuning ablations.

The model is not designed for:

High-stakes uses (medical, legal, financial, safety-critical decisions).
Long-form multi-document retrieval-augmented generation without an external RAG system.
Fully reliable formal math or large codebase refactors.

Evaluations

Below are performance figures for the neo-3-1B-A90M-Instruct model compared with instruction models in the Gemma/Qwen/SmolLM2 family.

Instruction-following performance

Model	MMLU	HellaSwag	PIQA	ARC avg	GSM8K	BBH	IFEval
neo-3-1B-A90M-Instruct	34.2	56.6	66.1	41.9	2.2	29.4	45.5
Gemma 3 IT 270M	31.2	37.7	66.2	32.1	11.4	26.7	51.2
SmolLM2-360M-Instruct	32.8	52.1	70.8	43.7	7.4	27.3	41.0
Qwen2.5-0.5B-Instruct	33.7	48.0	67.2	37.3	26.8	30.7	31.6

Tool Calling Performance

TinyTask is a benchmark that evaluates a model's ability to generate structured outputs, thus a mirror for tool calling performance. Our subset of TinyTask included 300 rows, 150 for travel problems and 150 for math problems. I made sure TinyTask outputs were not in any of my models' training data.

Model	TinyTask Accuracy
neo-3-1B-A90M-Instruct	30.0
Gemma 3 IT 270M	0.0
SmolLM2-360M-Instruct	7.5
Qwen2.5-0.5B-Instruct	5.0

Behavior in practice

Handles most day-to-day instructions with coherent, on-topic answers and is competitive with other sub-1B instruction models you compared (Gemma 3 IT 270M, SmolLM2-360M-Instruct, Qwen2.5 0.5B IT) on typical chat workloads.
Solves many GSM8K-style grade-school math problems with explicit reasoning when asked, though performance remains below larger 3B–7B models.
Produces concise explanations by default and can expand into more detailed chain-of-thought when explicitly prompted in research settings.

Usage

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aquiffoo/neo-3-1B-A90M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Explain why MoE models can have many total parameters but few active parameters per token."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9
)
print(tokenizer.decode(output, skip_special_tokens=True))

Chat formatting

The instruct model expects simple chat-style prompts with explicit user instructions. A minimal convention that works well:

<user>
You are a helpful assistant. Explain sparse mixture-of-experts models to a beginner.
</user>
<assistant>

Multi-turn chat can be created by concatenating user/assistant turns in the same style and re-feeding the whole context.

Training and data overview

Base model: neo-3-1B-A90M-Base trained on a mixture of Wikipedia, web-scale synthetic corpora (e.g., Cosmopedia-like), code (The Stack, GitHub), math, and dialogue sources.
Post-training:
- Supervised fine-tuning on instruction datasets (general chat, reasoning, math/code, tool-use style prompts).
- Light preference-style alignment using curated pairs that reward helpful, honest, non-toxic behavior.
Tokenization: SentencePiece/BPE with 32k vocabulary and RoPE-based positional encoding.

Limitations and risks

May hallucinate facts, especially for niche or very recent topics.
Reasoning chains can be shallow or brittle on harder benchmarks (MATH, GSM8K, BBH).
Output may reflect biases from pretraining and instruction datasets and is not suitable for sensitive content without additional filtering.
Users should not rely on this model for any domain where incorrect answers can cause harm.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for aquiffoo/neo-3-1B-A90M-Instruct

Base model

aquiffoo/neo-3-1B-A90M-Base

Finetuned

(1)

this model

Datasets used to train aquiffoo/neo-3-1B-A90M-Instruct

Collection including aquiffoo/neo-3-1B-A90M-Instruct

neo-3

Collection

My series of fully open, state-of-the-art small mixture-of-experts models. • 13 items • Updated 26 days ago • 1