Quentin Lhoest PRO
lhoestq
AI & ML interests
Maintainer of π€ Dataset Hub ecosystem: NLP, Multimodal data loading, viewing, processing and sharing
Recent Activity
liked
a dataset
about 23 hours ago
fhudson96/TAPVid360-10k
liked
a dataset
1 day ago
hugging-science/arc-aphasia-bids
liked
a dataset
1 day ago
yupp-ai/yupp-svg-20251204
Organizations
reacted to
KaraKaraWitch's
post with π₯
7 months ago
reacted to
ajibawa-2023's
post with π₯π
8 months ago
Post
4532
Hi All, I recently released two Audio datasets which are generated using my earlier released dataset:
ajibawa-2023/Children-Stories-Collection
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
reacted to
MikeDoes's
post with π
9 months ago
Post
2115
#PII Masking Tech that does not **** around!
We are happy to release the OpenPII English Anonymiser βthe most powerful open-source tool for redacting sensitive info from English text.
Fine-tuned Modernbert on 5.7 million+ PII examples, itβs clocking 99%+ accuracy across emails, dates, social numbers, and more!
Why itβs a big deal:
β Top-tier precision: 100% for passport numbers, 99.96% for emails*.
β Totally free: MIT license for personal or commercial use.
β No secrets: Full metrics shared on Hugging Face.
#AI #OpenSource #DataSecurity @huggingface
Day 2 out 7 of PII-Masking-1M Announcements Complete!
*Accuracies reported from the new OpenPII-500k dataset
ai4privacy/llama-ai4privacy-english-anonymiser-openpii
We are happy to release the OpenPII English Anonymiser βthe most powerful open-source tool for redacting sensitive info from English text.
Fine-tuned Modernbert on 5.7 million+ PII examples, itβs clocking 99%+ accuracy across emails, dates, social numbers, and more!
Why itβs a big deal:
β Top-tier precision: 100% for passport numbers, 99.96% for emails*.
β Totally free: MIT license for personal or commercial use.
β No secrets: Full metrics shared on Hugging Face.
#AI #OpenSource #DataSecurity @huggingface
Day 2 out 7 of PII-Masking-1M Announcements Complete!
*Accuracies reported from the new OpenPII-500k dataset
ai4privacy/llama-ai4privacy-english-anonymiser-openpii
reacted to
merve's
post with π€π₯
11 months ago
Post
2101
New smolagents example landed on Hugging Face cookbook π€
Learn how to create an inventory managing multi-agent system with smolagents, MongoDB and DeepSeek Chat π https://huggingface.co/learn/cookbook/mongodb_smolagents_multi_micro_agents
Learn how to create an inventory managing multi-agent system with smolagents, MongoDB and DeepSeek Chat π https://huggingface.co/learn/cookbook/mongodb_smolagents_multi_micro_agents
reacted to
ariG23498's
post with π
11 months ago
Post
2075
Timm β€οΈ Transformers
Wtih the latest version of transformers you can now use any timm model with the familiar transformers API.
Blog Post: https://huggingface.co/blog/timm-transformers
Repository with examples: https://github.com/ariG23498/timm-wrapper-examples
Collection: ariG23498/timmwrapper-6777b85f1e8d085d3f1374a1
Wtih the latest version of transformers you can now use any timm model with the familiar transformers API.
Blog Post: https://huggingface.co/blog/timm-transformers
Repository with examples: https://github.com/ariG23498/timm-wrapper-examples
Collection: ariG23498/timmwrapper-6777b85f1e8d085d3f1374a1
reacted to
singhsidhukuldeep's
post with π
11 months ago
Post
1152
Breaking News: LinkedIn's Content Search Engine Gets a Powerful Semantic Upgrade!
Excited to share insights about LinkedIn's innovative approach to content search, recently detailed in a groundbreaking paper by their Mountain View team. This advancement represents a significant shift from traditional keyword-based search to semantic understanding.
>> Technical Architecture
The new search engine employs a sophisticated two-layer architecture:
Retrieval Layer
- Token Based Retriever (TBR) for exact keyword matching
- Embedding Based Retriever (EBR) using a two-tower model with multilingual-e5 embeddings
- Pre-computed post embeddings stored in a dedicated embedding store for efficient retrieval
Multi-Stage Ranking
- L1 Stage: Initial filtering using a lightweight model
- L2 Stage: Advanced ranking with complex features including:
- Query-post semantic matching
- Author reputation analysis
- User engagement metrics
- Content freshness evaluation
>> Performance Improvements
The system has achieved remarkable results:
- 10%+ improvement in both on-topic rate and long-dwell metrics
- Enhanced ability to handle complex natural language queries
- Significant boost in sitewide engagement
This advancement enables LinkedIn to better serve complex queries like "how to ask for a raise?" while maintaining high performance at scale. The system intelligently balances between exact keyword matching and semantic understanding, ensuring optimal results for both navigational and conceptual searches.
What impresses me most is how the team solved the scale challenge - processing billions of posts efficiently using pre-computed embeddings and approximate nearest neighbor search. This is enterprise-scale AI at its finest.
Excited to share insights about LinkedIn's innovative approach to content search, recently detailed in a groundbreaking paper by their Mountain View team. This advancement represents a significant shift from traditional keyword-based search to semantic understanding.
>> Technical Architecture
The new search engine employs a sophisticated two-layer architecture:
Retrieval Layer
- Token Based Retriever (TBR) for exact keyword matching
- Embedding Based Retriever (EBR) using a two-tower model with multilingual-e5 embeddings
- Pre-computed post embeddings stored in a dedicated embedding store for efficient retrieval
Multi-Stage Ranking
- L1 Stage: Initial filtering using a lightweight model
- L2 Stage: Advanced ranking with complex features including:
- Query-post semantic matching
- Author reputation analysis
- User engagement metrics
- Content freshness evaluation
>> Performance Improvements
The system has achieved remarkable results:
- 10%+ improvement in both on-topic rate and long-dwell metrics
- Enhanced ability to handle complex natural language queries
- Significant boost in sitewide engagement
This advancement enables LinkedIn to better serve complex queries like "how to ask for a raise?" while maintaining high performance at scale. The system intelligently balances between exact keyword matching and semantic understanding, ensuring optimal results for both navigational and conceptual searches.
What impresses me most is how the team solved the scale challenge - processing billions of posts efficiently using pre-computed embeddings and approximate nearest neighbor search. This is enterprise-scale AI at its finest.
posted
an
update
12 months ago
Post
2898
Made a HF Dataset editor a la gg sheets here:
lhoestq/dataset-spreadsheets
With Dataset Spreadsheets:
βοΈ Edit datasets in the UI
π Share link with collaborators
π Use locally in DuckDB or Python
Available for the 100,000+ parquet datasets on HF :)
With Dataset Spreadsheets:
βοΈ Edit datasets in the UI
π Share link with collaborators
π Use locally in DuckDB or Python
Available for the 100,000+ parquet datasets on HF :)
reacted to
christopher's
post with π
12 months ago
Post
2443
The Lichess database of games, puzzles, and engine evaluations is now on the Hub:
Lichess
Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! βοΈ π€
- https://huggingface.co/collections/Lichess/positions-datasets-66f50837db5cd3287d60d489
- https://huggingface.co/collections/Lichess/games-datasets-66f508df78f4b43e1bb2d353
Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! βοΈ π€
- https://huggingface.co/collections/Lichess/positions-datasets-66f50837db5cd3287d60d489
- https://huggingface.co/collections/Lichess/games-datasets-66f508df78f4b43e1bb2d353
reacted to
christopher's
post with π₯
12 months ago
Post
2105
The folks at Foursquare released a dataset of 104.5 million places of interest (
foursquare/fsq-os-places) and here's all of them on a plot
reacted to
dvilasuero's
post with β€οΈπ₯
about 1 year ago
Post
2790
π Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
π·οΈ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. π½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βοΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Dataset: https://huggingface.co/datasets/CohereForAI/Global-MMLU
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
π·οΈ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. π½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βοΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Dataset: https://huggingface.co/datasets/CohereForAI/Global-MMLU
reacted to
davidberenstein1957's
post with π
about 1 year ago
Post
3530
The Data Is Better Together community is set to release the first Apache 2 licensed image preference dataset!
Great work and let's give this a final push :)
@aashish1904 congrats on your month of HF pro. There is more to win during this sprint!
@aashish1904 @AnyaDesdein @davidberenstein1957 @Malalatiana @beta3 @fffiloni @munish0838 @Reza2kn @bbunzeck @Creazycreator @andrei-saceleanu @jafhaponiuk @rca-etl @kf120 @burtenshaw @mmhamdy @grib0ed0v @Doopus @AnyaDes @ttkap @Xceron @Lewox @davanstrien @Azazelle @adirik @Ashish08 @AntonVic @kenantang @sdiazlor @g-ronimo @dennis-rall @prithivMLmods @girtss3 @flozi00 @WaveCut @Taylor658 @Wildminder @Sara9999 @phaelishall @sararob @dvilasuero @pgabrys @plaguss @CDS899 @timajwilliams @rudzinskimaciej @pavel-ai @aggr8 @ignacioct @MouseAI @Leeps @MaksKul @NicolasDmln @Muinez @kusht55 @caiolang @Jakub-Brand24 @loamy @Demijan @eliab96 @Viewegger @JosephCatrambone @p1atdev @mrshu @o639 @Targezed @Aviv-anthonnyolime @thliang01 @Ahmed-Amine @glards @pranaykoppula @nataliaElv @MaPirlet @alvarobartt @gabrielmbmb @zlicastro @Jaydip @Chouettecheveche @lilcheaty @ruyrdiaz @robintema @fdaudens @ggcristian @a-r-r-o-w @pates @joheras @stopsatgreen @bezo97 @chachi902 @iamyann @liamcripwell @dmb23 @korbih @anonymous7743 @akbdx18 @OVAWARE @severo @akontra @lichorosario @lhoestq @SebastianBodza @Vishnou @ameerazam08 @appoose @Mukei @mearco @joaquincabezas @Fizzarolli @thomastraum @igortopolski @OxxoCodes @patrickfleith @asoria @bn22 @sitammeur @Krodolf @bergr7f @Sbxxn @wietsevenema @sugatoray @Iamladi @MikeTrizna @feveromo @mokady @Bolero @prath @Dowwie @kfahn @decodingchris @alili2050 @RahulRaman @yzimmermann @Ameeeee @ecyht2 @MattMC001 @hemanthkumarak @Thegorgibus @akos2 @LawRun @ramithuh @SuperMuel @sjans @peterizsak @mosama @Eyel @mtr3 @cfahlgren1 @legentil @clem @Citaman @Aurelien-Morgan @AntoineBourgois @TotoB12 @Stanmey @osanseviero @multimodalart @maxiw @ariG23498 @ngk89 @femboysLover @dvs @tacohiddink @blanchon @DavidJimenez
Great work and let's give this a final push :)
@aashish1904 congrats on your month of HF pro. There is more to win during this sprint!
@aashish1904 @AnyaDesdein @davidberenstein1957 @Malalatiana @beta3 @fffiloni @munish0838 @Reza2kn @bbunzeck @Creazycreator @andrei-saceleanu @jafhaponiuk @rca-etl @kf120 @burtenshaw @mmhamdy @grib0ed0v @Doopus @AnyaDes @ttkap @Xceron @Lewox @davanstrien @Azazelle @adirik @Ashish08 @AntonVic @kenantang @sdiazlor @g-ronimo @dennis-rall @prithivMLmods @girtss3 @flozi00 @WaveCut @Taylor658 @Wildminder @Sara9999 @phaelishall @sararob @dvilasuero @pgabrys @plaguss @CDS899 @timajwilliams @rudzinskimaciej @pavel-ai @aggr8 @ignacioct @MouseAI @Leeps @MaksKul @NicolasDmln @Muinez @kusht55 @caiolang @Jakub-Brand24 @loamy @Demijan @eliab96 @Viewegger @JosephCatrambone @p1atdev @mrshu @o639 @Targezed @Aviv-anthonnyolime @thliang01 @Ahmed-Amine @glards @pranaykoppula @nataliaElv @MaPirlet @alvarobartt @gabrielmbmb @zlicastro @Jaydip @Chouettecheveche @lilcheaty @ruyrdiaz @robintema @fdaudens @ggcristian @a-r-r-o-w @pates @joheras @stopsatgreen @bezo97 @chachi902 @iamyann @liamcripwell @dmb23 @korbih @anonymous7743 @akbdx18 @OVAWARE @severo @akontra @lichorosario @lhoestq @SebastianBodza @Vishnou @ameerazam08 @appoose @Mukei @mearco @joaquincabezas @Fizzarolli @thomastraum @igortopolski @OxxoCodes @patrickfleith @asoria @bn22 @sitammeur @Krodolf @bergr7f @Sbxxn @wietsevenema @sugatoray @Iamladi @MikeTrizna @feveromo @mokady @Bolero @prath @Dowwie @kfahn @decodingchris @alili2050 @RahulRaman @yzimmermann @Ameeeee @ecyht2 @MattMC001 @hemanthkumarak @Thegorgibus @akos2 @LawRun @ramithuh @SuperMuel @sjans @peterizsak @mosama @Eyel @mtr3 @cfahlgren1 @legentil @clem @Citaman @Aurelien-Morgan @AntoineBourgois @TotoB12 @Stanmey @osanseviero @multimodalart @maxiw @ariG23498 @ngk89 @femboysLover @dvs @tacohiddink @blanchon @DavidJimenez
reacted to
rwightman's
post with π
about 1 year ago
Post
1405
I'm currently on a push to expand the scope of image based datasets on the Hub. There's certainly a lot already, but for anyone who's looked closely, there's not a whole lot of standardization. I am to fix that, datasets under the
timm
and
pixparse
orgs will serve as canonical examples for various task / modality combinations and be useable without fuss in libraries like
I just uploaded the first multi-label dataset that I'll support with
Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means
timm, OpenCLIP, and hopefully more.I just uploaded the first multi-label dataset that I'll support with
timm scripts soon:
timm/plant-pathology-2021 Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means
timm support for object detection, eventually segmentation, is finally under development :O
reacted to
merve's
post with π₯
about 1 year ago
Post
5503
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
π¨ a new vision language model with 9x less image tokens, super efficient
π aligned with DPO for reducing hallucinations
β‘οΈ Apache 2.0 license π₯
Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M
π¨ a new vision language model with 9x less image tokens, super efficient
π aligned with DPO for reducing hallucinations
β‘οΈ Apache 2.0 license π₯
Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M
reacted to
jsulz's
post with π
about 1 year ago
Post
2178
In August, the XetHub team joined Hugging Face
- https://huggingface.co/blog/xethub-joins-hf - and weβve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.
You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248
- https://huggingface.co/blog/xethub-joins-hf - and weβve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.
You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248
reacted to
asoria's
post with π
about 1 year ago
Post
2627
π I wrote a tutorial on how to get started with the fine-tuning process using Hugging Face tools, providing an end-to-end workflow.
The tutorial covers creating a new dataset using the new SQL Console π’ and fine-tuning a model with SFT, guided by the Notebook Creator App π.
π You can read the full article here:
https://huggingface.co/blog/asoria/easy-fine-tuning-with-hf
asoria/auto-notebook-creator
The tutorial covers creating a new dataset using the new SQL Console π’ and fine-tuning a model with SFT, guided by the Notebook Creator App π.
π You can read the full article here:
https://huggingface.co/blog/asoria/easy-fine-tuning-with-hf
asoria/auto-notebook-creator
reacted to
clem's
post with β€οΈ
over 1 year ago
Post
3889
This isnβt a goal of ours because we have plenty of money in the bank but quite excited to see that
@huggingfaceis
profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!
Especially noteworthy at a time when most AI startups wouldnβt survive a year or two without VC money. Yay!
Especially noteworthy at a time when most AI startups wouldnβt survive a year or two without VC money. Yay!
replied to
their
post
over 1 year ago
