Instructions to use meta-llama/Llama-3.1-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meta-llama/Llama-3.1-8B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use meta-llama/Llama-3.1-8B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meta-llama/Llama-3.1-8B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
- SGLang
How to use meta-llama/Llama-3.1-8B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-3.1-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-3.1-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use meta-llama/Llama-3.1-8B-Instruct with Docker Model Runner:
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
Wrong number of tensors; expected 292, got 291
ValueError: Ollama call failed with status code 500. Details: {"error":"llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291"}
These models are broken again even after 3rd attempt after updated huggingface repo? Ive never seen this error before.
Just use llamacpp. It has been updated with the RoPE scaling patch.
@qnixsynapse , I did use llamacpp to convert 32 bit safetensors to BF16, then quantized to Q8_0, Q_6_K, Q_5_K_M. Then convert to ollama in cli using modelfile. This is the message ollama kicks out, errors out. There is something big time broken that Meta need to sort out.
Why are you using ollama in the first place? Use llamacpp. The latest one has the rope scaling patch.
And it isn't a fault of meta if ollama doesn't update llamacpp being a wrapper.
I have had the same error message using llama3.1 from unsloth. I was trying to implement the example from the official site from the unsloth git:
https://github.com/unslothai/unsloth -> https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing
and the code from the youtuber Mervin:https://www.youtube.com/@MervinPraison -> https://mer.vin/2024/07/llama-3-1-fine-tune/
So unsloth was done with conversion and there was no error in both codes by creating the gguf file.
I was trying both, mervins code and the official code to load the gguf from unsloth to ollama, both with the same error:
Error: llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
Since unsloth implemented an automation to load llama.cpp by calling their functions, I had no idea what kind of version they loaded.
So I went in the llama.cpp directory (I have linux so it was "cd llama.cpp" - search for the llama.cpp folder in your project of course)
and then I executed: sudo git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72
And yes, I was asking chatGPT to help me with that problem. I am very happy, that it is working right now, but developing in this field seems to be not staple for the next years. I hope it will work on your system as well!
Best greetings
Matthias
Thanks you Budd, Apprecaited. Ill give it another go. Cheers
This should be fixed in latest Ollama