Instructions to use cyberagent/CAT-Paws-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyberagent/CAT-Paws-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cyberagent/CAT-Paws-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cyberagent/CAT-Paws-8B")
model = AutoModelForCausalLM.from_pretrained("cyberagent/CAT-Paws-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cyberagent/CAT-Paws-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyberagent/CAT-Paws-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Paws-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cyberagent/CAT-Paws-8B

SGLang

How to use cyberagent/CAT-Paws-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cyberagent/CAT-Paws-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Paws-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cyberagent/CAT-Paws-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberagent/CAT-Paws-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cyberagent/CAT-Paws-8B with Docker Model Runner:
```
docker model run hf.co/cyberagent/CAT-Paws-8B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

CAT-Paws 🐱

When you'd even welcome helping paws

CAT-Paws is an agentic LLM that thinks in Japanese (e.g., reasoning trace is in Japanese). The model is based on Qwen3-Swallow-v0.2 which is a continual pretraining model based on Qwen3 to read and write fluently in Japanese.

CAT-Paws is trained for multi-turn interactions involving coding, interacting with terminal, and tool usages. For non-agentic single-turn tasks, we recommend using its sibling model CAT-Thinking-8B.

Usage

Chat

To run with transformers, follow the following command.

pip install transformers

from transformers import pipeline

# Load the model
chat_pipeline = pipeline("text-generation", model="CyberAgent/CAT-Paws-8B")

prompt = "You have two cats, one male and one female. A female cat gives birth to up to 12 kittens per year.\n" + \
  "Assume you don't spay them. In three years, how many cats might you need to take care of at most?"

user_input = [{"role": "user", "content": prompt}]

response = chat_pipeline(user_input, max_new_tokens=8192, temperature=0.8, top_p=0.95)

print(response[0]['generated_text'])

CAT-Paws is designed to reason in Japanese even if the input text is in English. The model is trained with the maximum output token length of 4096. We recommend setting max_new_tokens to at least 4096, and larger for difficult problems. Although the model is trained to respond within 4096 tokens, it tends to generate longer responses, especially for difficult and/or confusing instructions. It often gets stuck in repetition, especially when the instruction is confusing (e.g., two contradicting instructions are given).

Harness

We recommend using a harness with minimum system prompts for CAT-Paws. Because the context length is quite limited (40k), it doesn't work well with long system prompts used for harnessing the frontier models. Instead, it will function well with simple and minimal agent harnesses such as mini-swe-agent and terminus-2. We recommend using harnesses that compress the message history so that the context length gets small.

Tool Calling

CAT-Paws is NOT trained to receive tools from a special interface. Instead, CAT-Paws accepts tools via the system prompt and/or user messages. To let CAT-Paws use tools, describe the tool usage and call format in the system message. JSON object is a recommended interface for the tool call format. For example:

[{\"name\": <function-name>, \"arguments\": <args-json-object>}]

CAT-Paws generates reasoning trace with high probability even if it is instructed not to. If the tool call requires the entire message to be in some formats, we recommend to preprocess the message and remove the reasoning trace (<think>...</think>) from the message.

Evaluation

Agentic Capability

We compare the performance of CAT-Paws-8B with Qwen-3-8B using j-tau-bench. We evaluate on telecom domain in Japanese and English using GLM-4.7-AWQ as a user simulator. Three trials are run. Overall, we observe CAT-Paws to be on par with Qwen-3 in English and marginally above in Japanese. As a reference we also run experiments using Qwen3.6-27B-FP8 for easier reproducibility. CAT-Paws achieves higher score than Qwen-3 with this setting too.

Benchmark	User LLM	CAT-Paws	Qwen-3-8B
telecom_ja	GLM-4.7	19.6	16.8
telecom	GLM-4.7	20.7	20.4
telecom_ja	Qwen3.6-27B	12.9	7.8

Additionally, we evaluate CAT-Paws on terminus-2 on harness-bench-fast and humaneval-fix. The scores of the other models (with * marks) are from the respective paper and not from our experiments. The accuracy of CAT-Paws is far from the frontier models, but it would be a lightweight solution for simple tasks. A more detailed evaluation will be presented in a technical report.

Model	harness-bench
Claude Opus 4.8 (Claude Code CLI)	100*
GPT-OSS-120B (deepagents)	49.5*
CAT-Paws (terminus-2)	27.3

Model	humaneval-fix (Python)
GPT-4	47.0*
CAT-Paws (terminus-2)	45.7

Coding and Math

We conducted evaluation on single-turn coding and math tasks in Japanese and English. We compare with Qwen-3-8B (Qwen-3), Qwen3-Swallow-8B-RL-v0.2 (Swallow), and CAT-Thinking-8B. Random sampling (temperature=0.8, top_p=0.95, max_new_tokens=4096) is used for all runs. Overall, CAT-Paws scores lower than the rest of the models. We observe that it often fails by calling tools that don't exist, asking further clarification to the user, trying to solve with multiple turns. For a single-turn task, we recommend using the other models.

Training Procedure

The training procedure mostly follows the same as CAT-Thinking-8B but with some modifications for agentic capability.

We generate a teacher dataset using gpt-oss-120b as a reference. The dataset consists of math, coding (Python, shell script), tool calling, and generic instruction following tasks. Since the reasoning traces are in English, we translate them into Japanese using CAT-Translate-7b and gpt-oss-20b. We train the Swallow model using the synthesized dataset with full-parameter SFT.

Then, we run GRPO with a permissive reward model which gives partial rewards for being able to (1) follow the reasoning format, (2) generate reasoning trace and the main text in Japanese, and (3) answer the question in an instructed format. In this way, the model learns to follow the reasoning format and generate its reasoning trace in Japanese. Since this training phase focuses on learning the superficial format rather than reasoning competence itself, we use LoRA.

Finally, we train the model with GRPO using a strict reward model that gives a reward only if the model follows all format constraints and also generates the correct answer. During the GRPO steps, we include multiturn tasks in coding, shell script, and tool calling domains.