🎮 Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo
Built in a garage, funded by pre-orders, no VC. Now we’re scaling to 1 k installer units.
We’re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.
👇 Drop “EUSKERA” in the comments if you want an invite, tag a friend who still thinks Alexa is “convenient,” and smash ♥️ if AI should belong to people - not servers.
Just wanted to annouce 🏭SmolFactory : it's the quickest and best way to finetune SmolLM3 and GPT-OSS-20B on huggingface !
Basicaly it's an app you can run on huggingface by duplicating the space and running your training directly on huggingface GPUs .
It will help you basically select datasets and models, fine tune your model , make an experiment tracker you can use on your mobile phone , push all your model card and even automatically make a demo for you on huggingface so you can directly test it out when it's done !
just submitted my plugin idea to the G-Assist Plugin Hackathon by @nvidia . Check it out, it's a great way to use a local SLA model on a windows machine to easily and locally get things done ! https://github.com/NVIDIA/G-Assist
So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :
basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!
this is terrible ! but the good news is that we can do something about it !
What inspired the Transformer architecture in the "Attention Is All You Need" paper? And how were various ideas combined to create this groundbreaking model?
In this lengthy article, I explore the story and the origins of some of the ideas introduced in the paper. We'll explore everything from the fundamental attention mechanism that lies at its heart to the surprisingly simple explanation for its name, Transformer.
💡 Examples of ideas explored in the article:
✅ What was the inspiration for the attention mechanism? ✅ How did we go from attention to self-attention? ✅ Did the team have any other names in mind for the model?
and more...
I aim to tell the story of Transformers as I would have wanted to read it, and hopefully, one that appeals to others interested in the details of this fascinating idea. This narrative draws from video interviews, lectures, articles, tweets/Xs, and some digging into the literature. I have done my best to be accurate, but errors are possible. If you find inaccuracies or have any additions, please do reach out, and I will gladly make the necessary updates.
Detect hallucinations in answers based on context and questions using ModernBERT with 8192-token context support!
### Model Details - **Model Name**: [lettucedect-large-modernbert-en-v1](KRLabsOrg/lettucedect-large-modernbert-en-v1) - **Organization**: [KRLabsOrg](KRLabsOrg) - **Github**: [https://github.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) - **Architecture**: ModernBERT (Large) with extended context support up to 8192 tokens - **Task**: Token Classification / Hallucination Detection - **Training Dataset**: [RagTruth](wandb/RAGTruth-processed) - **Language**: English - **Capabilities**: Detects hallucinated spans in answers, provides confidence scores, and calculates average confidence across detected spans.
LettuceDetect excels at processing long documents to determine if an answer aligns with the provided context, making it a powerful tool for ensuring factual accuracy.
🎉 We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information.
💡 But what makes MemoryCode unique?! The combination of the following:
✅ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.
✅ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.
✅ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.
✅ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.
✅ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.
📌 Our Findings
1️⃣ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.
2️⃣ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.
⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS
In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.
📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?
1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents. 2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business. 3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.
Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:
1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries. 2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.
💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.