AI & ML interests

tokenization, CHILDES, word segmentation, phonemes, BabyLM

phonemetransformers 's collections 3

BabyLM's First Words
Models trained on IPA-CHILDES and evaluated for phonological knowledge using the word segmentation task, linked to child language acquisition.
IPA CHILDES
The IPA-CHILDES dataset along with the models and tokenizers used for phoneme-based language modeling for the 31 languages in CHILDES.
From Babble to Words
The models, tokenizers and datasets used in From Babble to Words, one of the winning BabyLM 2024 submissions, exploring phoneme-based training.