Norod78
/

hebrew-gpt_neo-xl

Text Generation

Model card Files Files and versions

Norod78 commited on Jul 23, 2021

Commit

d645029

·

1 Parent(s): 09936fd

Update README.md

Files changed (1) hide show

README.md +16 -5

README.md CHANGED Viewed

@@ -25,6 +25,10 @@ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/
 The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
 ## Training Config
 Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
@@ -40,7 +44,7 @@ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/
 ```python
-!pip install tokenizers==0.10.2 transformers==4.6.0
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -87,7 +91,10 @@ if input_ids != None:
 print("Updated max_len = " + str(max_len))
 stop_token = "<|endoftext|>"
-new_lines = "\n\n\n"
 sample_outputs = model.generate(
     input_ids,
@@ -98,7 +105,9 @@ sample_outputs = model.generate(
     num_return_sequences=sample_output_num
 )
-print(100 * '-' + "\n\t\tOutput\n" + 100 * '-')
 for i, sample_output in enumerate(sample_outputs):
   text = tokenizer.decode(sample_output, skip_special_tokens=True)
@@ -109,7 +118,9 @@ for i, sample_output in enumerate(sample_outputs):
   # Remove all text after 3 newlines
   text = text[: text.find(new_lines) if new_lines else None]
-  print("\n{}: {}".format(i, text))
-  print("\n" + 100 * '-')
 ```

 The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
+3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
+Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
 ## Training Config
 Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
 ```python
+!pip install tokenizers==0.10.3 transformers==4.8.0
 from transformers import AutoTokenizer, AutoModelForCausalLM
 print("Updated max_len = " + str(max_len))
 stop_token = "<|endoftext|>"
+new_lines = "\
+\
+\
+"
 sample_outputs = model.generate(
     input_ids,
     num_return_sequences=sample_output_num
 )
+print(100 * '-' + "\
+\t\tOutput\
+" + 100 * '-')
 for i, sample_output in enumerate(sample_outputs):
   text = tokenizer.decode(sample_output, skip_special_tokens=True)
   # Remove all text after 3 newlines
   text = text[: text.find(new_lines) if new_lines else None]
+  print("\
+{}: {}".format(i, text))
+  print("\
+" + 100 * '-')
 ```