Update README.md

b1a3d2c verified over 1 year ago

4.57 kB

	---
	library_name: transformers
	datasets:
	- oscar
	- mc4
	- rasyosef/amharic-sentences-corpus
	language:
	- am
	metrics:
	- perplexity
	pipeline_tag: fill-mask
	widget:
	- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።
	example_title: Example 1
	- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር <mask> ግዢ በእጅጉ ጨምሯል።
	example_title: Example 2
	- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ <mask> ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
	example_title: Example 3
	- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል <mask> እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
	example_title: Example 4
	---

	# roberta-base-amharic

	This model has the same architecture as [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.

	The model was trained for 22 hours on an A100 40GB GPU.

	It achieves the following results on the evaluation set:

	- `Loss: 2.09`
	- `Perplexity: 8.08`

	This model has 110 Million parameters and is currently the best Amharic encoder model, beating the 2.5x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.

	# How to use
	You can use this model directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-base-amharic')
	>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")

	[{'score': 0.40162667632102966,
	'token': 137,
	'token_str': 'ዓመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
	{'score': 0.24096301198005676,
	'token': 346,
	'token_str': 'አመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
	{'score': 0.15971705317497253,
	'token': 217,
	'token_str': 'ዓመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
	{'score': 0.13074122369289398,
	'token': 733,
	'token_str': 'አመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
	{'score': 0.03847867250442505,
	'token': 194,
	'token_str': 'ዘመን',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመን ተቆጥሯል።'}]
	```

	# Finetuning

	This model was finetuned and evaluated on the following Amharic NLP tasks

	- Sentiment Classification
	- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
	- Code: https://github.com/rasyosef/amharic-sentiment-classification
	- Named Entity Recognition
	- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
	- Code: https://github.com/rasyosef/amharic-named-entity-recognition

	### Finetuned Model Performance
	The reported F1 scores are macro averages.

	\|Model\|Size (# params)\| Perplexity\|Sentiment (F1)\| Named Entity Recognition (F1)\|
	\|-----\|---------------\|-----------\|--------------\|------------------------------\|
	\|roberta-base-amharic\|110M\|8.08\|0.88\|0.78\|
	\|roberta-medium-amharic\|42.2M\|11.59\|0.84\|0.75\|
	\|bert-medium-amharic\|40.5M\|13.74\|0.83\|0.68\|
	\|bert-small-amharic\|27.8M\|15.96\|0.83\|0.68\|
	\|bert-mini-amharic\|10.7M\|22.42\|0.81\|0.64\|
	\|bert-tiny-amharic\|4.18M\|71.52\|0.79\|0.54\|
	\|xlm-roberta-base\|279M\|\|0.83\|0.73\|
	\|afro-xlmr-base\|278M\|\|0.83\|0.75\|
	\|afro-xlmr-large\|560M\|\|0.86\|0.76\|
	\|am-roberta\|443M\|\|0.82\|0.69\|