Text Generation
Transformers
Safetensors
qwen2
conversational
text-generation-inference
4-bit precision
bitsandbytes
Instructions to use neuralnets/fractal_r1_4bit_q with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use neuralnets/fractal_r1_4bit_q with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="neuralnets/fractal_r1_4bit_q") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("neuralnets/fractal_r1_4bit_q") model = AutoModelForCausalLM.from_pretrained("neuralnets/fractal_r1_4bit_q") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use neuralnets/fractal_r1_4bit_q with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "neuralnets/fractal_r1_4bit_q" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralnets/fractal_r1_4bit_q", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/neuralnets/fractal_r1_4bit_q
- SGLang
How to use neuralnets/fractal_r1_4bit_q with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "neuralnets/fractal_r1_4bit_q" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralnets/fractal_r1_4bit_q", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "neuralnets/fractal_r1_4bit_q" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralnets/fractal_r1_4bit_q", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use neuralnets/fractal_r1_4bit_q with Docker Model Runner:
docker model run hf.co/neuralnets/fractal_r1_4bit_q
Model is not supporting by TGI
#1
by maheshbabu9199 - opened
when trying to load this model using TGI, am getting the following which says
RuntimeError: [FT][ERROR] Invalid shape for quantized tensor. Number of rows of quantized matrix must be a multiple of 16 Assertion fail: /build/source/cutlass_kernels/cutlass_preprocessors.cc:164
text-generation-inference | 2025-05-27T08:01:55.437641Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output.
Anyone working on TGI, with this model.?
maheshbabu9199 changed discussion title from model is supported by tgi to Model is supporting by TGI
maheshbabu9199 changed discussion title from Model is supporting by TGI to Model is not supporting by TGI