ASR-1
ASR-1 is a Vietnamese automatic speech recognition model developed by UnderTheSea NLP.
Model Description
- Model Type: Fine-tuned Whisper for Vietnamese ASR
- Base Model: openai/whisper-large-v3
- Language: Vietnamese
- License: Apache 2.0
- Task: Automatic Speech Recognition (Speech-to-Text)
Installation
pip install transformers torch torchaudio datasets
Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from huggingface_hub import snapshot_download
import torchaudio
model_path = snapshot_download('undertheseanlp/asr-1')
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
input_features = processor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features, language="vi", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
API (compatible with underthesea)
from asr import AsrTranscriber, transcribe
text = transcribe("audio.wav")
print(text)
transcriber = AsrTranscriber.load("models/asr-1")
result = transcriber.transcribe("audio.wav")
print(result.text)
print(result.confidence)
Training
uv run src/train.py
uv run src/train.py --base-model openai/whisper-large-v3 --dataset common_voice
uv run src/train.py --wandb --wandb-project asr-1
Evaluation
uv run src/evaluate.py --model models/asr-1
uv run src/evaluate.py --model models/asr-1 --dataset vivos
Datasets
| Dataset |
Split |
Hours |
Samples |
| Common Voice 17.0 (vi) |
train |
~30h |
~25,000 |
| Common Voice 17.0 (vi) |
test |
~5h |
~5,000 |
| VIVOS |
train |
15h |
11,660 |
| VIVOS |
test |
0.6h |
760 |
Metrics
- WER (Word Error Rate): Lower is better
- CER (Character Error Rate): Lower is better
References
Citation
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
Technical Report
See TECHNICAL_REPORT.md for detailed methodology and evaluation.