Hugging Face |
GitHub |
Launch Blog |
Documentation
License: Apache 2.0 | Authors: Google DeepMind
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
Gemma 4 introduces key capability and architectural advancements:
Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
Native System Prompt Support – Gemma 4 introduces native support for the
systemrole, enabling more structured and controllable conversations.
# Gemma-4-E4B-it LongRoPE 1M GGUF Q8_0
**Model with extended context window, based on `google/gemma-4-E4B-it` using the LongRoPE method.**
- 🧠 **Base model:** [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)
- 📏 **Original context:** 128K tokens
- 🚀 **Extended context:** 1,048,576 tokens (1M) via **LongRoPE**
- 📦 **Format:** GGUF, quantization **Q8_0**
- ⚙️ **Compatibility:** LM Studio, llama.cpp, and other GGUF‑compatible engines
---
## 🔍 Description
This version of the model was obtained by converting the official instruction-tuned `google/gemma-4-E4B-it` into the universal GGUF format and then extending the context window using the **LongRoPE** technique.
The original context length was 128 thousand tokens; after applying LongRoPE, the model can handle up to **1 million tokens** of continuous dialogue.
Quantization is performed in 8-bit `Q8_0` format, offering a good balance between quality and performance.
> ⚠️ **Important:** Extending the context by interpolating positional embeddings inevitably affects quality. The model has become somewhat “dumber” compared to the original, especially on complex multi-step reasoning tasks. However, with a proper set of parameters and Flash Attention disabled, it delivers satisfactory results on standard tasks.
---
## 📊 Performance
Test system:
| Component | Specification |
|-----------|---------------|
| CPU | 2× Intel Xeon E5-2695 v4 @ 2.10GHz (AVX, AVX2) |
| RAM | 512 GB |
| GPU | NVIDIA GeForce RTX 3060 12 GB (CUDA 12.9, Compute Capability 8.6) |
**Inference speed:**
- **LM Studio 0.4.12 (Build 1)**: stable **~21 tokens/s**
- **llama.cpp (server, no CPU offload)**:
- Start: **34 tokens/s**
- End of context fill: drops to **18 tokens/s**
---
## 🧩 Recommended settings
### For LM Studio
Create a preset named, e.g., “BEST”, and set the following parameters:
```json
{
"identifier": "@local:best",
"name": "BEST",
"changed": true,
"operation": {
"fields": [
{ "key": "llm.prediction.temperature", "value": 1.3 },
{ "key": "llm.prediction.contextOverflowPolicy", "value": "rollingWindow" },
{ "key": "llm.prediction.llama.cpuThreads", "value": 32 },
{ "key": "llm.prediction.topKSampling", "value": 500 },
{ "key": "llm.prediction.repeatPenalty", "value": { "checked": true, "value": 1 } },
{ "key": "llm.prediction.llama.presencePenalty", "value": { "checked": true, "value": 0 } },
{ "key": "llm.prediction.topPSampling", "value": { "checked": true, "value": 0.99 } },
{ "key": "llm.prediction.minPSampling", "value": { "checked": true, "value": 0.05 } }
]
},
"load": {
"fields": []
}
}
- Temperature is recommended in the range 1.0 – 1.3.
- Keep Flash Attention disabled — with it, the model degrades more.
For llama.cpp (server)
Example launch, under which the model reliably solves logical “Einstein puzzles”:
"E:\LLM\llama.cpp\build\bin\llama-server.exe" \
-m "C:/LLM/Nikitayev/google_gemma-4-E4B-it/google_gemma-4-E4B-it-q8_0.gguf" \
--mmproj "C:/LLM/lmstudio-community/gemma-4-E4B-it-GGUF/mmproj-gemma-4-E4B-it-BF16.gguf" \
--host 127.0.0.1 --port 8080 \
--timeout 60000 --threads-http -1 \
--ctx-size 1048576 \
--flash-attn on --fit off --kv-offload \
--mmap --cont-batching --webui --jinja --embedding --metrics --slots --cache-prompt --mlock \
--reasoning-format auto \
--temp 0.75 --dynatemp-range 0.75 \
--top-k 10000 --top-p 0.99 --min-p 0.05 \
--xtc-probability 0 \
--repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 \
--dry-multiplier 0.0 \
--samplers "penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature" \
--n-predict 8192 --seed 0
Note: The
--flash-attn onflag is left here because in some llama.cpp scenarios the combination of flash attention + sampling parameters works better than in LM Studio. Try--flash-attn offif you experience instability.
🧪 Observations and Conclusions
- Extending the context with LongRoPE leads to a noticeable but not critical drop in the model’s “intelligence”.
- With Flash Attention disabled, the quality of answers on standard conversational tasks remains acceptable.
- On complex logical tasks, the model remains capable with carefully chosen sampling parameters (see examples above).
- Using GPU offloading is critical; CPU-only inference drops speed dramatically. The RTX 3060 with 12 GB allows loading all model weights into VRAM.
📁 Files
google_gemma-4-E4B-it-q8_0.gguf– main GGUF Q8_0 model with extended context.mmproj-gemma-4-E4B-it-BF16.gguf– multimodal embedding projector (original, BF16), required for the full pipeline.
👤 Author
Nikitayev
📧 nikitayev@mail.ru
📧 nikitayev1979@gmail.com
This model was created for research purposes. Distributed under the terms of the original Google Gemma model license. ```
- Downloads last month
- 576
8-bit
Model tree for nikitayev/gemma-4-E4B-it-1M
Base model
google/gemma-4-E4B-it