π‘οΈ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)
This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.
It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)
It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.
This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.
β¨ Features
- Encoder-only architecture β gives scores
- Multi-turn support β handles conversation history and context windows
- Hybrid input domain β optimized for both chat messages & email bodies
- High-throughput β suitable for millions of messages/day
- Ideal for security filters (spam, scams, phishing, self-promotion content)
- Open-source and deployable anywhere (CPU or GPU)
π§ Model Architecture
Type: Encoder-only (XLM Roberta Large)
Input format:
[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]
Labels include:
spamregular(ham)marketinggibberish
Benchmark
- F1 Spam: 0.90
- F1 Regular: 0.95
- F1 Marketing: 0.87
- F1 Gibberish: 0.94
While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.
- Downloads last month
- 364
Model tree for baptistejamin/xlm-roberta-large-spam_v4
Base model
FacebookAI/xlm-roberta-large