Updated README.md

c215234 about 4 years ago

7.4 kB

	---
	tags:
	- summarization
	- mT5
	datasets:
	- csebuetnlp/xlsum
	language:
	- am
	- ar
	- az
	- bn
	- my
	- zh
	- en
	- fr
	- gu
	- ha
	- hi
	- ig
	- id
	- ja
	- rn
	- ko
	- ky
	- mr
	- ne
	- om
	- ps
	- fa
	- pcm
	- pt
	- pa
	- ru
	- gd
	- sr
	- si
	- so
	- es
	- sw
	- ta
	- te
	- th
	- ti
	- tr
	- uk
	- ur
	- uz
	- vi
	- cy
	- yo
	licenses:
	- cc-by-nc-sa-4.0
	widget:
	- text: "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""

	---

	# mT5-multilingual-XLSum

	This repository contains the mT5 checkpoint finetuned on the 45 languages of [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) dataset. For finetuning details and scripts,
	see the [paper](https://aclanthology.org/2021.findings-acl.413/) and the [official repository](https://github.com/csebuetnlp/xl-sum).


	## Using this model in `transformers` (tested on 4.11.0.dev0)

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	article_text = """Input article text"""

	model_name = "csebuetnlp/mT5_multilingual_XLSum"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	input_ids = tokenizer.prepare_seq2seq_batch(
	[article_text.replace("\n", " ").strip()],
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=512
	)["input_ids"]

	output_ids = model.generate(
	input_ids=input_ids,
	max_length=84,
	no_repeat_ngram_size=2,
	num_beams=4
	)[0]

	summary = tokenizer.decode(
	output_ids,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False
	)
	print(summary)
	```

	## Benchmarks

	Scores on test sets are given below.

	Language \| ROUGE-1 / ROUGE-2 / ROUGE-L
	---------\|----------------------------
	Amharic \| 20.0485 / 7.4111 / 18.0753
	Arabic \| 34.9107 / 14.7937 / 29.1623
	Azerbaijani \| 21.4227 / 9.5214 / 19.3331
	Bengali \| 29.5653 / 12.1095 / 25.1315
	Burmese \| 15.9626 / 5.1477 / 14.1819
	Chinese (Simplified) \| 39.4071 / 17.7913 / 33.406
	Chinese (Traditional) \| 37.1866 / 17.1432 / 31.6184
	English \| 37.601 / 15.1536 / 29.8817
	French \| 35.3398 / 16.1739 / 28.2041
	Gujarati \| 21.9619 / 7.7417 / 19.86
	Hausa \| 39.4375 / 17.6786 / 31.6667
	Hindi \| 38.5882 / 16.8802 / 32.0132
	Igbo \| 31.6148 / 10.1605 / 24.5309
	Indonesian \| 37.0049 / 17.0181 / 30.7561
	Japanese \| 48.1544 / 23.8482 / 37.3636
	Kirundi \| 31.9907 / 14.3685 / 25.8305
	Korean \| 23.6745 / 11.4478 / 22.3619
	Kyrgyz \| 18.3751 / 7.9608 / 16.5033
	Marathi \| 22.0141 / 9.5439 / 19.9208
	Nepali \| 26.6547 / 10.2479 / 24.2847
	Oromo \| 18.7025 / 6.1694 / 16.1862
	Pashto \| 38.4743 / 15.5475 / 31.9065
	Persian \| 36.9425 / 16.1934 / 30.0701
	Pidgin \| 37.9574 / 15.1234 / 29.872
	Portuguese \| 37.1676 / 15.9022 / 28.5586
	Punjabi \| 30.6973 / 12.2058 / 25.515
	Russian \| 32.2164 / 13.6386 / 26.1689
	Scottish Gaelic \| 29.0231 / 10.9893 / 22.8814
	Serbian (Cyrillic) \| 23.7841 / 7.9816 / 20.1379
	Serbian (Latin) \| 21.6443 / 6.6573 / 18.2336
	Sinhala \| 27.2901 / 13.3815 / 23.4699
	Somali \| 31.5563 / 11.5818 / 24.2232
	Spanish \| 31.5071 / 11.8767 / 24.0746
	Swahili \| 37.6673 / 17.8534 / 30.9146
	Tamil \| 24.3326 / 11.0553 / 22.0741
	Telugu \| 19.8571 / 7.0337 / 17.6101
	Thai \| 37.3951 / 17.275 / 28.8796
	Tigrinya \| 25.321 / 8.0157 / 21.1729
	Turkish \| 32.9304 / 15.5709 / 29.2622
	Ukrainian \| 23.9908 / 10.1431 / 20.9199
	Urdu \| 39.5579 / 18.3733 / 32.8442
	Uzbek \| 16.8281 / 6.3406 / 15.4055
	Vietnamese \| 32.8826 / 16.2247 / 26.0844
	Welsh \| 32.6599 / 11.596 / 26.1164
	Yoruba \| 31.6595 / 11.6599 / 25.0898



	## Citation

	If you use this model, please cite the following paper:
	```
	@inproceedings{hasan-etal-2021-xl,
	title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
	author = "Hasan, Tahmid and
	Bhattacharjee, Abhik and
	Islam, Md. Saiful and
	Mubasshir, Kazi and
	Li, Yuan-Fang and
	Kang, Yong-Bin and
	Rahman, M. Sohel and
	Shahriyar, Rifat",
	booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
	month = aug,
	year = "2021",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-acl.413",
	pages = "4693--4703",
	}
	```