Model Overview

  • Model Architecture: Phi3ForCausalLM
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Activation quantization: FP8
    • Weight quantization: FP8
  • Intended Use Cases: This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:
  1. Memory/compute constrained environments.
  2. Latency bound scenarios.
  3. Math reasoning and logic.
  • Release Date: 01/26/2026
  • Version: 1.0
  • Model Developers: Red Hat

Model Optimizations

This model was obtained by quantizing activation and weights of Phi-4-reasoning to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

generated_text = client.chat.completions.create(
    model="RedHatAI/Phi-4-reasoning-FP8-dynamic",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language model."},
    ],
)
print(generated_text.choices[0].message.content)

Creation

Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

# Load model
model_stub = "microsoft/Phi-4-reasoning"
model_name = model_stub.split("/")[-1]

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_dynamic",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using lighteval and vLLM.

Evaluation commands

litellm_config.yaml

model_parameters:
  provider: "hosted_vllm"
  model_name: "hosted_vllm/RedHatAI/Phi-4-reasoning-FP8-dynamic"
  base_url: "http://0.0.0.0:8000/v1"
  api_key: ""
  timeout: 1200
  concurrent_requests: 64
  generation_parameters:
    temperature: 0.8
    top_k: 50
    top_p: 0.95
    max_new_tokens: 24000
lighteval endpoint litellm litellm_config.yaml \
    gpqa:diamond|0,math_500|0,aime25|0 \
    --output-dir phi4_reasoning_fp8_dynamic \
    --save-details

Accuracy

Benchmark Phi-4-reasoning Phi-4-reasoning FP8-dynamic
(this model)
Recovery
AIME25 61.25 64.58 105.4%
GPQA Diamond 64.65 66.50 102.9%
Math 500 90.01 88.60 98.4%
Downloads last month
-
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RedHatAI/Phi-4-reasoning-FP8-dynamic

Base model

microsoft/phi-4
Quantized
(35)
this model