legacy-datasets/c4
Updated • 9.73k • 242
How to use philschmid/t5-11b-sharded with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="philschmid/t5-11b-sharded") # Load model directly
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("philschmid/t5-11b-sharded")
model = AutoModelWithLMHead.from_pretrained("philschmid/t5-11b-sharded")How to use philschmid/t5-11b-sharded with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "philschmid/t5-11b-sharded"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "philschmid/t5-11b-sharded",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/philschmid/t5-11b-sharded
How to use philschmid/t5-11b-sharded with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "philschmid/t5-11b-sharded" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "philschmid/t5-11b-sharded",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "philschmid/t5-11b-sharded" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "philschmid/t5-11b-sharded",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use philschmid/t5-11b-sharded with Docker Model Runner:
docker model run hf.co/philschmid/t5-11b-sharded
This is fork of t5-11b implementing a custom
handler.pyas an example for how to uset5-11bwith inference-endpoints on a single NVIDIA T4.
Hugging Face Inference endpoints can be used with an HTTP client in any language. We will use Python and the requests library to send our requests. (make your you have it installed pip install requests)
import json
import requests as r
ENDPOINT_URL=""# url of your endpoint
HF_TOKEN=""
# payload samples
regular_payload = { "inputs": "translate English to German: The weather is nice today." }
parameter_payload = {
"inputs": "translate English to German: Hello my name is Philipp and I am a Technical Leader at Hugging Face",
"parameters" : {
"max_length": 40,
}
}
# HTTP headers for authorization
headers= {
"Authorization": f"Bearer {HF_TOKEN}",
"Content-Type": "application/json"
}
# send request
response = r.post(ENDPOINT_URL, headers=headers, json=paramter_payload)
generated_text = response.json()
print(generated_text)