Commit
·
0209b4e
1
Parent(s):
dcddc1a
Slight readme update
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ model-index:
|
|
| 26 |
type: loss
|
| 27 |
value: 0.2869
|
| 28 |
---
|
| 29 |
-
# DalaT5
|
| 30 |
|
| 31 |
> 'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.
|
| 32 |
|
|
@@ -101,7 +101,7 @@ print(output)
|
|
| 101 |
|
| 102 |
Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:
|
| 103 |
|
| 104 |
-
- The first ~
|
| 105 |
- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
|
| 106 |
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
|
| 107 |
|
|
|
|
| 26 |
type: loss
|
| 27 |
value: 0.2869
|
| 28 |
---
|
| 29 |
+
# DalaT5 - T5 Fine-Tuned on Cyrillic-to-Latin Kazakh 🇰🇿
|
| 30 |
|
| 31 |
> 'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.
|
| 32 |
|
|
|
|
| 101 |
|
| 102 |
Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:
|
| 103 |
|
| 104 |
+
- The first ~2 million records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
|
| 105 |
- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
|
| 106 |
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
|
| 107 |
|