CAT-Paws 🐱

License: Apache 2.0 Hugging Face

When you'd even welcome helping paws

Cat sleeping on top of a laptop.

CAT-Paws is an agentic LLM that thinks in Japanese (e.g., reasoning trace is in Japanese). The model is based on Qwen3-Swallow-v0.2 which is a continual pretraining model based on Qwen3 to read and write fluently in Japanese.

CAT-Paws is trained for multi-turn interactions involving coding, interacting with terminal, and tool usages. For non-agentic single-turn tasks, we recommend using its sibling model CAT-Thinking-8B.

Usage

Chat

To run with transformers, follow the following command.

pip install transformers
from transformers import pipeline

# Load the model
chat_pipeline = pipeline("text-generation", model="CyberAgent/CAT-Paws-8B")

prompt = "You have two cats, one male and one female. A female cat gives birth to up to 12 kittens per year.\n" + \
  "Assume you don't spay them. In three years, how many cats might you need to take care of at most?"

user_input = [{"role": "user", "content": prompt}]

response = chat_pipeline(user_input, max_new_tokens=8192, temperature=0.8, top_p=0.95)

print(response[0]['generated_text'])

CAT-Paws is designed to reason in Japanese even if the input text is in English. The model is trained with the maximum output token length of 4096. We recommend setting max_new_tokens to at least 4096, and larger for difficult problems. Although the model is trained to respond within 4096 tokens, it tends to generate longer responses, especially for difficult and/or confusing instructions. It often gets stuck in repetition, especially when the instruction is confusing (e.g., two contradicting instructions are given).

Harness

We recommend using a harness with minimum system prompts for CAT-Paws. Because the context length is quite limited (40k), it doesn't work well with long system prompts used for harnessing the frontier models. Instead, it will function well with simple and minimal agent harnesses such as mini-swe-agent and terminus-2. We recommend using harnesses that compress the message history so that the context length gets small.

Tool Calling

CAT-Paws is NOT trained to receive tools from a special interface. Instead, CAT-Paws accepts tools via the system prompt and/or user messages. To let CAT-Paws use tools, describe the tool usage and call format in the system message. JSON object is a recommended interface for the tool call format. For example:

[{\"name\": <function-name>, \"arguments\": <args-json-object>}]

CAT-Paws generates reasoning trace with high probability even if it is instructed not to. If the tool call requires the entire message to be in some formats, we recommend to preprocess the message and remove the reasoning trace (<think>...</think>) from the message.

Evaluation

Agentic Capability

We compare the performance of CAT-Paws-8B with Qwen-3-8B using j-tau-bench. We evaluate on telecom domain in Japanese and English using GLM-4.7-AWQ as a user simulator. Three trials are run. Overall, we observe CAT-Paws to be on par with Qwen-3 in English and marginally above in Japanese. As a reference we also run experiments using Qwen3.6-27B-FP8 for easier reproducibility. CAT-Paws achieves higher score than Qwen-3 with this setting too.

Benchmark User LLM CAT-Paws Qwen-3-8B
telecom_ja GLM-4.7 19.6 16.8
telecom GLM-4.7 20.7 20.4
telecom_ja Qwen3.6-27B 12.9 7.8

Additionally, we evaluate CAT-Paws on terminus-2 on harness-bench-fast and humaneval-fix. The scores of the other models (with * marks) are from the respective paper and not from our experiments. The accuracy of CAT-Paws is far from the frontier models, but it would be a lightweight solution for simple tasks. A more detailed evaluation will be presented in a technical report.

Model harness-bench
Claude Opus 4.8 (Claude Code CLI) 100*
GPT-OSS-120B (deepagents) 49.5*
CAT-Paws (terminus-2) 27.3
Model humaneval-fix (Python)
GPT-4 47.0*
CAT-Paws (terminus-2) 45.7

Coding and Math

Evaluation of CAT-Paws in coding and math tasks.

We conducted evaluation on single-turn coding and math tasks in Japanese and English. We compare with Qwen-3-8B (Qwen-3), Qwen3-Swallow-8B-RL-v0.2 (Swallow), and CAT-Thinking-8B. Random sampling (temperature=0.8, top_p=0.95, max_new_tokens=4096) is used for all runs. Overall, CAT-Paws scores lower than the rest of the models. We observe that it often fails by calling tools that don't exist, asking further clarification to the user, trying to solve with multiple turns. For a single-turn task, we recommend using the other models.

Training Procedure

The training procedure mostly follows the same as CAT-Thinking-8B but with some modifications for agentic capability.

We generate a teacher dataset using gpt-oss-120b as a reference. The dataset consists of math, coding (Python, shell script), tool calling, and generic instruction following tasks. Since the reasoning traces are in English, we translate them into Japanese using CAT-Translate-7b and gpt-oss-20b. We train the Swallow model using the synthesized dataset with full-parameter SFT.

Then, we run GRPO with a permissive reward model which gives partial rewards for being able to (1) follow the reasoning format, (2) generate reasoning trace and the main text in Japanese, and (3) answer the question in an instructed format. In this way, the model learns to follow the reasoning format and generate its reasoning trace in Japanese. Since this training phase focuses on learning the superficial format rather than reasoning competence itself, we use LoRA.

Finally, we train the model with GRPO using a strict reward model that gives a reward only if the model follows all format constraints and also generates the correct answer. During the GRPO steps, we include multiturn tasks in coding, shell script, and tool calling domains.

License

The model is licensed under the Apache 2.0 License.

Citation

TBA

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyberagent/CAT-Paws-8B

Finetuned
Qwen/Qwen3-8B
Finetuned
(1771)
this model
Quantizations
2 models

Dataset used to train cyberagent/CAT-Paws-8B

Collection including cyberagent/CAT-Paws-8B