BEEspoke Data

community

https://www.bees.org/

AI & ML interests

'an LLM is only as good as the dataset it was trained on' - Sun Tzu

kenhktsui

authored a paper 2 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29 • 7

huu-ontocord

authored a paper 2 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29 • 7

pszemraj

updated a model 2 months ago

BEE-spoke-data/neobert-100k-test

Fill-Mask • 0.1B • Updated Oct 2 • 19

pszemraj

published a model 2 months ago

BEE-spoke-data/neobert-100k-test

Fill-Mask • 0.1B • Updated Oct 2 • 19

pszemraj

updated 2 datasets 4 months ago

BEE-spoke-data/govdocs1-pdf-source

Viewer • Updated Aug 3 • 235k • 1.13k • 3

BEE-spoke-data/govdocs1-by-extension

Viewer • Updated Jul 27 • 733k • 1.1k • 2

amazingvince

updated a dataset 5 months ago

BEE-spoke-data/SurvivorLib-Nanonets-OCR-s

Viewer • Updated Jul 14 • 11.7k • 100 • 2

pszemraj

updated a collection 5 months ago

Survivor Library Books - OCR

Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs • 2 items • Updated Jul 14 • 5

pszemraj

updated 2 datasets 5 months ago

BEE-spoke-data/SurvivorLib-Nanonets-OCR-s

Viewer • Updated Jul 14 • 11.7k • 100 • 2

BEE-spoke-data/SurvivorLib-rolmOCR

Viewer • Updated Jul 8 • 13.3k • 84 • 1

amazingvince

published a dataset 5 months ago

BEE-spoke-data/SurvivorLib-Nanonets-OCR-s

Viewer • Updated Jul 14 • 11.7k • 100 • 2

pszemraj

published a dataset 5 months ago

BEE-spoke-data/SurvivorLib-rolmOCR

Viewer • Updated Jul 8 • 13.3k • 84 • 1

kenhktsui

authored a paper 5 months ago

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Paper • 2507.02778 • Published Jul 3 • 9

huu-ontocord

authored 2 papers 6 months ago

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Paper • 2505.20033 • Published May 26 • 4

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Paper • 2506.09827 • Published Jun 11 • 20

huu-ontocord

authored 3 papers 9 months ago

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 56

LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

Paper • 2412.15035 • Published Dec 19, 2024 • 4

Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs

Paper • 2502.19413 • Published Feb 26 • 21

qnguyen3

posted an update over 1 year ago

Post

4915

nanoLLaVA-1.5 is here! Same size (1B), better performance 🔥🔥🔥
And it is much more powerful than v1.0
Try it out now on HF Spaces: qnguyen3/nanoLLaVA
Model: qnguyen3/nanoLLaVA-1.5

3 replies

·