Bangla Whisper Large V3 - Bengali Speech Recognition Model
This model is a fine-tuned version of openai/whisper-large-v3 for Bengali (Bangla) speech recognition.
Model Description
- Base Model: Whisper Large V3
- Language: Bengali (bn)
- Task: Automatic Speech Recognition (Transcription)
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
- Training Data: 3182 samples from 20 Bengali regional dialects
Training Details
Training Data
- Total Samples: 3350
- Training Samples: 3182
- Validation Samples: 168
- Regions: 20 different Bengali-speaking regions (Dhaka, Chittagong, Sylhet, Rajshahi, Khulna, etc.)
Training Hyperparameters
- Epochs: 1
- Batch Size: 8
- Gradient Accumulation: 2
- Effective Batch Size: 16
- Learning Rate: 0.0001
- Optimizer: OptimizerNames.ADAMW_TORCH
- LoRA Rank: 32
- LoRA Alpha: 32
- Target Modules: q_proj, v_proj, k_proj, o_proj, encoder_attn layers, fc1, fc2
Training Results
- Training Time: 41.20 minutes
- Final Training Loss: 0.3332
- Speed: 1.29 samples/second
- GPU: Tesla T4
Usage
Installation
pip install transformers librosa soundfile torch
Basic Usage
import torch
import librosa
import numpy as np
from transformers import WhisperForConditionalGeneration, WhisperProcessor
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(
"seyam2023/bangla-whisper-large-v3",
torch_dtype=torch.float16,
device_map="auto"
)
processor = WhisperProcessor.from_pretrained("seyam2023/bangla-whisper-large-v3")
# Load audio file
audio, _ = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = audio / (np.max(np.abs(audio)) + 1e-8)
# Process and transcribe
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(model.device, dtype=torch.float16)
with torch.no_grad():
pred_ids = model.generate(
input_features,
max_new_tokens=128,
num_beams=1,
do_sample=False
)
transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(transcription)
Batch Processing
# Process multiple audio files
audio_files = ["file1.wav", "file2.wav", "file3.wav"]
for audio_file in audio_files:
audio, _ = librosa.load(audio_file, sr=16000, mono=True)
audio = audio / (np.max(np.abs(audio)) + 1e-8)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(model.device, dtype=torch.float16)
with torch.no_grad():
pred_ids = model.generate(input_features, max_new_tokens=128)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(f"{audio_file}: {text}")
Performance
This model has been trained on diverse Bengali audio data from multiple regional dialects, making it robust for:
- Standard Bengali (Dhaka dialect)
- Regional variations (Chittagong, Sylhet, Noakhali, etc.)
- Various audio quality conditions
- Different speaking styles and speeds
Limitations
- Optimized for Bengali language only
- Performance may vary with:
- Heavy background noise
- Very low-quality audio recordings
- Non-native Bengali speakers
- Mixed language speech (code-switching)
Training Infrastructure
- Hardware: Tesla T4
- Framework: PyTorch, Transformers, Unsloth
- Precision: FP16 (Mixed Precision Training)
Citation
If you use this model, please cite:
@misc{bangla-whisper-large-v3,
author = {Touhidul Alam Seyam and Md Abtahee Kabir and Noore Tamanna Orny},
title = {Bangla Whisper Large V3: Bengali Speech Recognition},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/seyam2023/bangla-whisper-large-v3}}
}
Team
Team Huntrix
- Touhidul Alam Seyam
- Md Abtahee Kabir
- Noore Tamanna Orny
Acknowledgments
- Base model: OpenAI Whisper Large V3
- Training optimization: Unsloth
- Dataset: Custom Bengali regional audio dataset
License
Apache 2.0
Contact
For questions or feedback, please open an issue in the model repository or contact Team Huntrix.
- Downloads last month
- 4
Evaluation results
- Training Loss on Bengali Regional Audio Datasetself-reported0.333