Hugging Face | GitHub | Launch Blog | Documentation
License: Apache 2.0 | Authors: Google DeepMind

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

  • Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.

  • Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).

  • Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.

  • Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.

  • Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.

  • Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.

  • Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

# Gemma-4-E4B-it LongRoPE 1M GGUF Q8_0

**Model with extended context window, based on `google/gemma-4-E4B-it` using the LongRoPE method.**

- 🧠 **Base model:** [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)  
- 📏 **Original context:** 128K tokens  
- 🚀 **Extended context:** 1,048,576 tokens (1M) via **LongRoPE**  
- 📦 **Format:** GGUF, quantization **Q8_0**  
- ⚙️ **Compatibility:** LM Studio, llama.cpp, and other GGUF‑compatible engines

---

## 🔍 Description

This version of the model was obtained by converting the official instruction-tuned `google/gemma-4-E4B-it` into the universal GGUF format and then extending the context window using the **LongRoPE** technique.  
The original context length was 128 thousand tokens; after applying LongRoPE, the model can handle up to **1 million tokens** of continuous dialogue.  
Quantization is performed in 8-bit `Q8_0` format, offering a good balance between quality and performance.

> ⚠️ **Important:** Extending the context by interpolating positional embeddings inevitably affects quality. The model has become somewhat “dumber” compared to the original, especially on complex multi-step reasoning tasks. However, with a proper set of parameters and Flash Attention disabled, it delivers satisfactory results on standard tasks.

---

## 📊 Performance

Test system:

| Component | Specification |
|-----------|---------------|
| CPU | 2× Intel Xeon E5-2695 v4 @ 2.10GHz (AVX, AVX2) |
| RAM | 512 GB |
| GPU | NVIDIA GeForce RTX 3060 12 GB (CUDA 12.9, Compute Capability 8.6) |

**Inference speed:**

- **LM Studio 0.4.12 (Build 1)**: stable **~21 tokens/s**  
- **llama.cpp (server, no CPU offload)**:  
  - Start: **34 tokens/s**  
  - End of context fill: drops to **18 tokens/s**

---

## 🧩 Recommended settings

### For LM Studio

Create a preset named, e.g., “BEST”, and set the following parameters:

```json
{
  "identifier": "@local:best",
  "name": "BEST",
  "changed": true,
  "operation": {
    "fields": [
      { "key": "llm.prediction.temperature", "value": 1.3 },
      { "key": "llm.prediction.contextOverflowPolicy", "value": "rollingWindow" },
      { "key": "llm.prediction.llama.cpuThreads", "value": 32 },
      { "key": "llm.prediction.topKSampling", "value": 500 },
      { "key": "llm.prediction.repeatPenalty", "value": { "checked": true, "value": 1 } },
      { "key": "llm.prediction.llama.presencePenalty", "value": { "checked": true, "value": 0 } },
      { "key": "llm.prediction.topPSampling", "value": { "checked": true, "value": 0.99 } },
      { "key": "llm.prediction.minPSampling", "value": { "checked": true, "value": 0.05 } }
    ]
  },
  "load": {
    "fields": []
  }
}
  • Temperature is recommended in the range 1.0 – 1.3.
  • Keep Flash Attention disabled — with it, the model degrades more.

For llama.cpp (server)

Example launch, under which the model reliably solves logical “Einstein puzzles”:

"E:\LLM\llama.cpp\build\bin\llama-server.exe" \
  -m "C:/LLM/Nikitayev/google_gemma-4-E4B-it/google_gemma-4-E4B-it-q8_0.gguf" \
  --mmproj "C:/LLM/lmstudio-community/gemma-4-E4B-it-GGUF/mmproj-gemma-4-E4B-it-BF16.gguf" \
  --host 127.0.0.1 --port 8080 \
  --timeout 60000 --threads-http -1 \
  --ctx-size 1048576 \
  --flash-attn on --fit off --kv-offload \
  --mmap --cont-batching --webui --jinja --embedding --metrics --slots --cache-prompt --mlock \
  --reasoning-format auto \
  --temp 0.75 --dynatemp-range 0.75 \
  --top-k 10000 --top-p 0.99 --min-p 0.05 \
  --xtc-probability 0 \
  --repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 \
  --dry-multiplier 0.0 \
  --samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" \
  --n-predict 8192 --seed 0

Note: The --flash-attn on flag is left here because in some llama.cpp scenarios the combination of flash attention + sampling parameters works better than in LM Studio. Try --flash-attn off if you experience instability.


🧪 Observations and Conclusions

  • Extending the context with LongRoPE leads to a noticeable but not critical drop in the model’s “intelligence”.
  • With Flash Attention disabled, the quality of answers on standard conversational tasks remains acceptable.
  • On complex logical tasks, the model remains capable with carefully chosen sampling parameters (see examples above).
  • Using GPU offloading is critical; CPU-only inference drops speed dramatically. The RTX 3060 with 12 GB allows loading all model weights into VRAM.

📁 Files

  • google_gemma-4-E4B-it-q8_0.gguf – main GGUF Q8_0 model with extended context.
  • mmproj-gemma-4-E4B-it-BF16.gguf – multimodal embedding projector (original, BF16), required for the full pipeline.

👤 Author

Nikitayev
📧 nikitayev@mail.ru
📧 nikitayev1979@gmail.com

This model was created for research purposes. Distributed under the terms of the original Google Gemma model license. ```

Downloads last month
576
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nikitayev/gemma-4-E4B-it-1M

Quantized
(140)
this model