Image-Text-to-Text
Transformers
Safetensors
llava
image-to-text
conversational

I want this model to be great, but it is broken at the moment

#10
by nlisac - opened

I have been wrestling with the chat template for a few days now - spending hours working on it in fact. I have used what others have done and that solves my issue of getting the 'Error predicting: Error: Error rendering prompt with jinja template: "Cannot apply filter "length" to type: UndefinedValue".' when I am using the model in LM Studio. I am using this quant, but have tried others. https://huggingface.co/bartowski/ServiceNow-AI_Apriel-1.6-15b-Thinker-GGUF

After I solve the issue of the chat template, I end up not being able to get a response from the model after my first response. I will go "Tell me a fun fact about cheetahs." and the model will respond. Then I say "More!" and it thinks and then outputs no response. This is where I got stuck, so I saw that there was an online chat I could use, so I thought I would give it a try.

In using the https://huggingface.co/spaces/ServiceNow-AI/Apriel-Chat , I used my same cheetah prompt and when I said "More!" it spun off into a wild thinking chain and has currently been stuck looping in that chain for several minutes.

I think there is something fundamentally wrong with the model (or the chat template, at least). I don't know what it is but I would love to have this fixed, as these models are very exciting for me.

I went through these struggles with the 1.5 version as well - though it ended up being solvable for me in the end with a new chat template. I wish that there was more testing/validation from this team on these models in the wild and ideally something more standardized. The thinking parameters are so specific and non-standard, for example, that anyone spinning up this model will not be able to use it without a custom thinking token start/stop definition. This is not a large hurdle, but that alone would stop an average user with other options. That, combined with the jinja templates that are not working, make these models sadly unapproachable for many when they obviously have promise and a lot of well-intentioned work put into them.

Also, does it really start with "Here are my reasoning steps:\n" as mentioned in model card? Doesn't seem so. I just tried the latest updated gguf

They released a GGUF model and it works. I wanted a higher quant, so I stole the chat template and slapped it into a Q8 quant and it looks like everything is now working flawlessly! https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker-GGUF

With your gguf's template, I'm still getting the loop behavior:

prompt

image

What you see here?

answer

Here are my reasoning steps:
The user asks: "What you see here?" They show an image: a movie poster for "The Taste of Memory". The question: "What you see here?" Probably they want a description of the image. So we need to respond with a description. The user likely expects a short answer: "A movie poster". Or "a scene with a man and a woman in a kitchen". Let's parse the image.

The image shows a poster for "The Taste of Memory". It features a man and a woman in a rustic kitchen. The man is wearing an apron, cutting something on a wooden table. The woman is looking at him. There are wooden crates with grapes, a painting on the wall, a lantern, etc. The top includes logos: "Eleanor Vance Academy Award Winner", "Sundance Film Festival 2024", "Arthur Penhaldon Golden Globe Award Winner". The bottom includes "A film by Tongyi Interaction Lab". So the answer: "A movie poster for The Taste of Memory". Or "A still from a film". The question: "What you see here?" Possibly they want "a poster". Or "a man and a woman cooking". Or "a kitchen scene". The user might be testing the model's ability to interpret images. The answer should be short.

Given the context, the simplest answer: "A movie poster". Or "A scene from a film". Let's see the phrasing: "What you see here?" Usually that question expects a description like "a man and a woman in a kitchen". But the user might want a simple answer: "a poster". The image includes a lot of text, so it's definitely a poster.

Thus answer: "A movie poster for 'The Taste of Memory'". Or "A film poster". Let's respond with a concise description.

We can also mention the two actors: "Eleanor Vance and Arthur Penhaldon". But the question is generic: "What you see here?" So answer: "A movie poster". Let's answer that.

We need to follow the guidelines: no disallowed content. It's safe. We'll answer.

We can also mention the scene: "A man and a woman in a kitchen". But the question is ambiguous. Let's answer: "A movie poster". That is likely correct.

Thus final answer: "A movie poster for the film 'The Taste of Memory'".
<|im_end|>
Here are my reasoning steps:
The user asks: "What you see here?" They show an image. The answer should be a description. The simplest answer: "A movie poster". Or "A still from a film". The question is ambiguous but likely expects "a movie poster". I'll answer: "A movie poster for 'The Taste of Memory'".
<|im_end|>
Here are my reasoning steps:
We need to respond with a short answer. The user likely expects a description: "A movie poster". So answer: "A movie poster".
<|im_end|>
Here are my reasoning steps:
The user asks "What you see here?" The image is a movie poster. So answer: "A movie poster".
<|im_end|>
Here are my reasoning steps:
The user asks "What you see here?" The image is a movie poster for "The Taste of Memory". So answer: "A movie poster".
<|im_end|>
...

But in the first place, and it seems to be the cause, I can't get reliable answer formatting: sometimes ending <im_end>, sometimes not (1). And sometimes including a Here are my reasoning steps: inside the answer (2). For those 2 following example the generation stopped correctly though.

(1)

We'll answer in a single sentence: "It's a movie poster for 'The Taste of Memory'". That should be correct.

We'll not mention policy. We'll answer succinctly.
<|im_end|>
Here is a movie poster for **The Taste of Memory**, showing a man and a woman in a rustic kitchen.
\boxed{Poster}

(2)

Now, the user might want a short answer: "The Taste of Memory". Or "A film poster". I'll answer: "A movie poster for 'The Taste of Memory'". That is safe.

Thus final answer: "A poster for the film 'The Taste of Memory'".
<|im_end|>
Here are my reasoning steps:
The user asks "What you see here?" They provided an image. The answer should be a description. The image is a movie poster for "The Taste of Memory". So answer: "A movie poster". Or "The Taste of Memory". I'll answer succinctly: "A movie poster for the film 'The Taste of Memory'".
<|im_end|>

+1 for the latest official gguf being stuck and looping

Sign up or log in to comment