What makes the 'web' model different?

#6
by nlisac - opened

I am trying to learn about finetuning models. I would love to finetune a Gemma model, however I am looking to use it in a Web application, which needs special consideration in the type of model I use.

In your version of this model, for web, what are the special settings or operations used - I assume in conversion - that allows for a web-friendly model vs. a non-web Friendly model. Thank you!

Hi @nlisac ,

Welcome to Gemma models, thanks for reaching out to us. The most direct indicator of web-friendly is the model's file extension and the runtime environment it's designed for.
Standard Model: Typically distributed in formats like PyTorch (.pth, .safetensors) or Hugging Face Transformers, which are designed for powerful server-side GPUs.

Web-Friendly Model: The suffix litert-lm in the model name stands for LiteRT-LM, which is Google's dedicated framework for on-device or edge inference.

  1. The model is converted into a proprietary, highly optimized format like .litertlm (or sometimes a .task file bundle containing a mix of smaller optimized TFLite files).
  2. This format is designed to be used with the MediaPipe LLM Inference API or the underlying LiteRT-LM runtime, which can run efficiently within a web worker thread using WebAssembly (Wasm).

The base Gemma 3n architecture already includes features that make it inherently more web-friendly before conversion:
1.Selective Parameter Activation (MatFormer): Gemma 3n uses the Matryoshka Transformer architecture, which allows for the selective activation of only a subset of the total parameters based on the task or device resources. This reduces the effective parameter count (E2B) and the computational cost per request.
2. Per-Layer Embedding (PLE) Caching: This technique allows key embedding parameters to be cached to fast, local storage, reducing runtime memory requirements.
3. On-Device Focus: The Gemma family, particularly the 3n variants, are specifically engineered by Google for efficient execution on low-resource devices like mobile phones and, by extension, web browsers.
Thanks.

@BalakrishnaCh

Thanks for the detailed answer.

A follow up question from my side:
Whats the difference then between gemma-3n-E2B-it-int4-Web.litertlm and gemma-3n-E2B-it-int4.litertlm? As both models are already in the litertlm format.

Could you please clarify the sources from which you are referencing these two models?

Thanks.

Could you please clarify the sources from which you are referencing these two models?

Thanks.

@BalakrishnaCh
From this repository (the Files tab):
https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/tree/main

He means what makes the models techincally different. They are both litertlm models. Nothing in your answer described the specifics of what makes a litertlm WEB performant model compared to a non-web performant litertlm model.

I, too, have the same question. There are two models, one is specially marked as web. What's different about it?

Possibly related, on the 1B model page, there's also separate web and Android models. The Android specific models were benchmarked on Samsung devices while the web-specific model was run on a MacBook Pro and is much more performant. My guess is they use different block sizes? Waiting for a proper response.

Hi all,

The core distinction is that the gemma-3n-E2B-it-int4-Web.litertlm model has been specifically converted and optimized for execution in a web browser, whereas the other litertlm file is a more general-purpose on-device model, likely optimized for mobile/edge devices.

Both models are in the litertlm format, which is Google's optimized format for on-device inference with the LiteRT-LM runtime. However, the one marked -Web has undergone an additional optimization step to make it more performant and efficient for the web environment, which relies on WebAssembly (Wasm) and WebGPU.

As seen in the repository's file list, the Web.litertlm model is smaller (3.04 GB) than the standard litertlm model (3.39 GB). This reduction is due to platform-specific optimizations, such as different prefill signature lengths, optimizations to the KV cache layout, all tailored to run more efficiently on WebGPU.

  1. gemma-3n-E2B-it-int4-Web.litertlm --> 3.04 GB --> Optimized for Web (WebGPU/Wasm)
  2. gemma-3n-E2B-it-int4.litertlm --> 3.39 GB --> General On-Device (e.g Android)

While both are litertlm models, the -Web suffix indicates that the model has been purpose-built and optimized for the unique constraints and capabilities of a web browser runtime, making it the correct and recommended choice for any web-based application.

Thanks.

Google org

One small clarification: LiteRT-LM doesn't run on web at all yet, so the LiteRT team's web LLM models actually use a completely different LLM runtime (and therefore are converted into a totally different format under-the-hood; the file extensions for our "-web.*" models are a bit of a misnomer in that respect).

This can potentially make them more performant and give them additional functionality (like lower CPU memory usage, better fp16 overflow handling, and allowing the user to specify arbitrary context sizes), since these models are "hand-crafted" rather than automatically generated through a more sophisticated conversion process. But these models will not generally work with our other LiteRT tools and systems (e.g., Model Explorer won't work on any web LLM files), so while some of the earlier "hand-crafted" models also run quite well on Android, Desktop, or iOS, this is not a use case we're supporting, and the focus for these "-web" models at present is the browser. In fact, moving forwards, the team plans to eventually use LiteRT-LM for web as well.

For those curious about the web side specifically, I gave some more details in my recent WebAI Summit talk.

Sign up or log in to comment