Update README.md
Browse files
README.md
CHANGED
|
@@ -25,6 +25,10 @@ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/
|
|
| 25 |
|
| 26 |
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
## Training Config
|
| 29 |
|
| 30 |
Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
|
|
@@ -40,7 +44,7 @@ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/
|
|
| 40 |
|
| 41 |
```python
|
| 42 |
|
| 43 |
-
!pip install tokenizers==0.10.
|
| 44 |
|
| 45 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 46 |
|
|
@@ -87,7 +91,10 @@ if input_ids != None:
|
|
| 87 |
print("Updated max_len = " + str(max_len))
|
| 88 |
|
| 89 |
stop_token = "<|endoftext|>"
|
| 90 |
-
new_lines = "\
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
sample_outputs = model.generate(
|
| 93 |
input_ids,
|
|
@@ -98,7 +105,9 @@ sample_outputs = model.generate(
|
|
| 98 |
num_return_sequences=sample_output_num
|
| 99 |
)
|
| 100 |
|
| 101 |
-
print(100 * '-' + "\
|
|
|
|
|
|
|
| 102 |
for i, sample_output in enumerate(sample_outputs):
|
| 103 |
|
| 104 |
text = tokenizer.decode(sample_output, skip_special_tokens=True)
|
|
@@ -109,7 +118,9 @@ for i, sample_output in enumerate(sample_outputs):
|
|
| 109 |
# Remove all text after 3 newlines
|
| 110 |
text = text[: text.find(new_lines) if new_lines else None]
|
| 111 |
|
| 112 |
-
print("\
|
| 113 |
-
|
|
|
|
|
|
|
| 114 |
|
| 115 |
```
|
|
|
|
| 25 |
|
| 26 |
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
| 27 |
|
| 28 |
+
3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
|
| 29 |
+
|
| 30 |
+
Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
|
| 31 |
+
|
| 32 |
## Training Config
|
| 33 |
|
| 34 |
Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
|
|
|
|
| 44 |
|
| 45 |
```python
|
| 46 |
|
| 47 |
+
!pip install tokenizers==0.10.3 transformers==4.8.0
|
| 48 |
|
| 49 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 50 |
|
|
|
|
| 91 |
print("Updated max_len = " + str(max_len))
|
| 92 |
|
| 93 |
stop_token = "<|endoftext|>"
|
| 94 |
+
new_lines = "\
|
| 95 |
+
\
|
| 96 |
+
\
|
| 97 |
+
"
|
| 98 |
|
| 99 |
sample_outputs = model.generate(
|
| 100 |
input_ids,
|
|
|
|
| 105 |
num_return_sequences=sample_output_num
|
| 106 |
)
|
| 107 |
|
| 108 |
+
print(100 * '-' + "\
|
| 109 |
+
\t\tOutput\
|
| 110 |
+
" + 100 * '-')
|
| 111 |
for i, sample_output in enumerate(sample_outputs):
|
| 112 |
|
| 113 |
text = tokenizer.decode(sample_output, skip_special_tokens=True)
|
|
|
|
| 118 |
# Remove all text after 3 newlines
|
| 119 |
text = text[: text.find(new_lines) if new_lines else None]
|
| 120 |
|
| 121 |
+
print("\
|
| 122 |
+
{}: {}".format(i, text))
|
| 123 |
+
print("\
|
| 124 |
+
" + 100 * '-')
|
| 125 |
|
| 126 |
```
|