Instructions to use meta-llama/Llama-3.1-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Llama-3.1-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meta-llama/Llama-3.1-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Llama-3.1-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct

SGLang

How to use meta-llama/Llama-3.1-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Llama-3.1-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Llama-3.1-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Llama-3.1-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
```

Wrong number of tensors; expected 292, got 291

#69

by KingBadger - opened Jul 30, 2024

Discussion

KingBadger

Jul 30, 2024

ValueError: Ollama call failed with status code 500. Details: {"error":"llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291"}

These models are broken again even after 3rd attempt after updated huggingface repo? Ive never seen this error before.

qnixsynapse

Jul 31, 2024

Just use llamacpp. It has been updated with the RoPE scaling patch.

KingBadger

Jul 31, 2024

@qnixsynapse , I did use llamacpp to convert 32 bit safetensors to BF16, then quantized to Q8_0, Q_6_K, Q_5_K_M. Then convert to ollama in cli using modelfile. This is the message ollama kicks out, errors out. There is something big time broken that Meta need to sort out.

qnixsynapse

Jul 31, 2024

•

edited Jul 31, 2024

Why are you using ollama in the first place? Use llamacpp. The latest one has the rope scaling patch.

And it isn't a fault of meta if ollama doesn't update llamacpp being a wrapper.

mbergner

Aug 1, 2024

I have had the same error message using llama3.1 from unsloth. I was trying to implement the example from the official site from the unsloth git:
https://github.com/unslothai/unsloth -> https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing
and the code from the youtuber Mervin:https://www.youtube.com/@MervinPraison -> https://mer.vin/2024/07/llama-3-1-fine-tune/

So unsloth was done with conversion and there was no error in both codes by creating the gguf file.

I was trying both, mervins code and the official code to load the gguf from unsloth to ollama, both with the same error:
Error: llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

Since unsloth implemented an automation to load llama.cpp by calling their functions, I had no idea what kind of version they loaded.
So I went in the llama.cpp directory (I have linux so it was "cd llama.cpp" - search for the llama.cpp folder in your project of course)
and then I executed: sudo git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72

And yes, I was asking chatGPT to help me with that problem. I am very happy, that it is working right now, but developing in this field seems to be not staple for the next years. I hope it will work on your system as well!
Best greetings
Matthias

KingBadger

Aug 2, 2024

Thanks you Budd, Apprecaited. Ill give it another go. Cheers

melmass

Aug 20, 2024

This should be fixed in latest Ollama

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment