Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠1.07k ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠20 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠29 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠1k ⢠1
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠29 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠15 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠28 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠11 ⢠1
trained and adapted tokenizers - various
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠4.07k ⢠30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠1.08k ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠3.04k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠1.07k ⢠4
Pretrained encoder (fill-mask) models we made
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠15 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠24.6M ⢠Updated ⢠27 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠11 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠17
concept datasets extracted from fineweb
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠4.07k ⢠30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠1.08k ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠3.04k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠1.07k ⢠4
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠1.07k ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠20 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠29 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠1k ⢠1
Pretrained encoder (fill-mask) models we made
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠29 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠15 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠28 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠11 ⢠1
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠15 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠24.6M ⢠Updated ⢠27 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠11 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠17
trained and adapted tokenizers - various
concept datasets extracted from fineweb