YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Synthetic Translation Dataset Generator

A professional, clean-code implementation for generating synthetic translation datasets using LLMs.

Project Structure

synthetic_projects/
├── src/
│   ├── core/               # Shared core utilities
│   │   ├── config.py       # Configuration management
│   │   ├── models.py       # Data models
│   │   ├── llm_client.py   # LLM API client
│   │   └── worker_pool.py  # Multiprocessing worker pool
│   ├── asr_translation/    # ASR -> English translation
│   │   ├── prompts.py      # Translation prompts
│   │   ├── models.py       # Data models
│   │   ├── processor.py    # Data processing logic
│   │   └── runner.py       # Main runner script
│   └── chat_translation/   # Chat -> English translation + moderation
│       ├── prompts.py      # Translation & moderation prompts
│       ├── models.py       # Data models
│       ├── processor.py    # Data processing logic
│       └── runner.py       # Main runner script
├── tests/                  # Unit tests
├── scripts/                # Background execution scripts
└── configs/                # Configuration files

Features

  • Clean Architecture: Separation of concerns with modular design
  • Type-Safe: Full type hints with Pydantic models
  • Configurable: YAML/ENV-based configuration
  • Efficient: Multi-CPU processing with dynamic batching
  • Resilient: Retry logic and error handling
  • Testable: Comprehensive test coverage
  • Maintainable: Well-documented, easy to extend

Sub-Projects

1. ASR Translation

Translates Vietnamese ASR transcriptions to well-written English text.

Input: Raw ASR transcriptions (Vietnamese, unnormalized)
Output: Clean, well-organized English translations

2. Chat Translation

Translates Vietnamese chat messages to formal English with content moderation.

Input: Vietnamese chat messages
Output:

  • Formal English translation
  • Political compliance metadata (Vietnam laws)

Quick Start

Installation

pip install -r requirements.txt

Configuration

Create .env file:

VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL=Qwen/Qwen3-Next-80B-A3B-Instruct
MAX_WORKERS=8
BATCH_SIZE=32

Run ASR Translation

python -m src.asr_translation.runner \
    --input translation_for_asr/telephone2000h.txt \
    --output outputs/asr_translated.jsonl \
    --num-workers 8

Run Chat Translation

python -m src.chat_translation.runner \
    --dataset tarudesu/VOZ-HSD \
    --output outputs/chat_translated.jsonl \
    --num-workers 8

Background Execution

# ASR translation in background
nohup bash scripts/run_asr_translation.sh > logs/asr.log 2>&1 &

# Chat translation in background
nohup bash scripts/run_chat_translation.sh > logs/chat.log 2>&1 &

Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support