πŸ›‘οΈ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)

This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.

It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)

It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.

This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.


✨ Features

  • Encoder-only architecture β†’ gives scores
  • Multi-turn support β†’ handles conversation history and context windows
  • Hybrid input domain β†’ optimized for both chat messages & email bodies
  • High-throughput β†’ suitable for millions of messages/day
  • Ideal for security filters (spam, scams, phishing, self-promotion content)
  • Open-source and deployable anywhere (CPU or GPU)

πŸ”§ Model Architecture

  • Type: Encoder-only (XLM Roberta Large)

  • Input format:

[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]

Labels include:

  • spam
  • regular (ham)
  • marketing
  • gibberish

Benchmark

  • F1 Spam: 0.90
  • F1 Regular: 0.95
  • F1 Marketing: 0.87
  • F1 Gibberish: 0.94

While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.

Downloads last month
364
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for baptistejamin/xlm-roberta-large-spam_v4

Finetuned
(847)
this model