YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Synthetic Translation Dataset Generator
A professional, clean-code implementation for generating synthetic translation datasets using LLMs.
Project Structure
synthetic_projects/
├── src/
│ ├── core/ # Shared core utilities
│ │ ├── config.py # Configuration management
│ │ ├── models.py # Data models
│ │ ├── llm_client.py # LLM API client
│ │ └── worker_pool.py # Multiprocessing worker pool
│ ├── asr_translation/ # ASR -> English translation
│ │ ├── prompts.py # Translation prompts
│ │ ├── models.py # Data models
│ │ ├── processor.py # Data processing logic
│ │ └── runner.py # Main runner script
│ └── chat_translation/ # Chat -> English translation + moderation
│ ├── prompts.py # Translation & moderation prompts
│ ├── models.py # Data models
│ ├── processor.py # Data processing logic
│ └── runner.py # Main runner script
├── tests/ # Unit tests
├── scripts/ # Background execution scripts
└── configs/ # Configuration files
Features
- Clean Architecture: Separation of concerns with modular design
- Type-Safe: Full type hints with Pydantic models
- Configurable: YAML/ENV-based configuration
- Efficient: Multi-CPU processing with dynamic batching
- Resilient: Retry logic and error handling
- Testable: Comprehensive test coverage
- Maintainable: Well-documented, easy to extend
Sub-Projects
1. ASR Translation
Translates Vietnamese ASR transcriptions to well-written English text.
Input: Raw ASR transcriptions (Vietnamese, unnormalized)
Output: Clean, well-organized English translations
2. Chat Translation
Translates Vietnamese chat messages to formal English with content moderation.
Input: Vietnamese chat messages
Output:
- Formal English translation
- Political compliance metadata (Vietnam laws)
Quick Start
Installation
pip install -r requirements.txt
Configuration
Create .env file:
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL=Qwen/Qwen3-Next-80B-A3B-Instruct
MAX_WORKERS=8
BATCH_SIZE=32
Run ASR Translation
python -m src.asr_translation.runner \
--input translation_for_asr/telephone2000h.txt \
--output outputs/asr_translated.jsonl \
--num-workers 8
Run Chat Translation
python -m src.chat_translation.runner \
--dataset tarudesu/VOZ-HSD \
--output outputs/chat_translated.jsonl \
--num-workers 8
Background Execution
# ASR translation in background
nohup bash scripts/run_asr_translation.sh > logs/asr.log 2>&1 &
# Chat translation in background
nohup bash scripts/run_chat_translation.sh > logs/chat.log 2>&1 &
Testing
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
License
MIT
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support