Instructions to use rasyosef/roberta-base-amharic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rasyosef/roberta-base-amharic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="rasyosef/roberta-base-amharic")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("rasyosef/roberta-base-amharic") model = AutoModelForMaskedLM.from_pretrained("rasyosef/roberta-base-amharic") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| datasets: | |
| - oscar | |
| - mc4 | |
| - rasyosef/amharic-sentences-corpus | |
| language: | |
| - am | |
| metrics: | |
| - perplexity | |
| pipeline_tag: fill-mask | |
| widget: | |
| - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል። | |
| example_title: Example 1 | |
| - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር <mask> ግዢ በእጅጉ ጨምሯል። | |
| example_title: Example 2 | |
| - text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ <mask> ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው። | |
| example_title: Example 3 | |
| - text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል <mask> እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው። | |
| example_title: Example 4 | |
| # roberta-base-amharic | |
| This model has the same architecture as [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 Million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k. | |
| The model was trained for **22 hours** on an **A100 40GB GPU**. | |
| It achieves the following results on the evaluation set: | |
| - `Loss: 2.09` | |
| - `Perplexity: 8.08` | |
| This model has **110 Million parameters** and is currently the **best** Amharic encoder model, beating the 2.5x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks. | |
| # How to use | |
| You can use this model directly with a pipeline for masked language modeling: | |
| ```python | |
| >>> from transformers import pipeline | |
| >>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-base-amharic') | |
| >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።") | |
| [{'score': 0.40162667632102966, | |
| 'token': 137, | |
| 'token_str': 'ዓመት', | |
| 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'}, | |
| {'score': 0.24096301198005676, | |
| 'token': 346, | |
| 'token_str': 'አመት', | |
| 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'}, | |
| {'score': 0.15971705317497253, | |
| 'token': 217, | |
| 'token_str': 'ዓመታት', | |
| 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'}, | |
| {'score': 0.13074122369289398, | |
| 'token': 733, | |
| 'token_str': 'አመታት', | |
| 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'}, | |
| {'score': 0.03847867250442505, | |
| 'token': 194, | |
| 'token_str': 'ዘመን', | |
| 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመን ተቆጥሯል።'}] | |
| ``` | |
| # Finetuning | |
| This model was finetuned and evaluated on the following Amharic NLP tasks | |
| - **Sentiment Classification** | |
| - Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment) | |
| - Code: https://github.com/rasyosef/amharic-sentiment-classification | |
| - **Named Entity Recognition** | |
| - Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition) | |
| - Code: https://github.com/rasyosef/amharic-named-entity-recognition | |
| ### Finetuned Model Performance | |
| The reported F1 scores are macro averages. | |
| |Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)| | |
| |-----|---------------|-----------|--------------|------------------------------| | |
| |**roberta-base-amharic**|**110M**|**8.08**|**0.88**|**0.78**| | |
| |roberta-medium-amharic|42.2M|11.59|0.84|0.75| | |
| |bert-medium-amharic|40.5M|13.74|0.83|0.68| | |
| |bert-small-amharic|27.8M|15.96|0.83|0.68| | |
| |bert-mini-amharic|10.7M|22.42|0.81|0.64| | |
| |bert-tiny-amharic|4.18M|71.52|0.79|0.54| | |
| |xlm-roberta-base|279M||0.83|0.73| | |
| |afro-xlmr-base|278M||0.83|0.75| | |
| |afro-xlmr-large|560M||0.86|0.76| | |
| |am-roberta|443M||0.82|0.69| | |