PubMind Input Filtering BERT Model

Model Description

This model is a fine-tuned DistilBERT-based binary text classification model developed as part of PubMind, a literature-based framework for extracting genetic variant information, disease associations, pathogenicity classifications, functional annotations, and supporting evidence from biomedical literature.

PubMind is designed to process large-scale biomedical literature, including PubMed abstracts and PMC full texts, and identify text passages that contain useful genetic variant information. Because the full biomedical literature corpus is extremely large, this model was developed as an input filtering / triage model to reduce irrelevant text before downstream large language model extraction.

The model classifies each input paragraph or abstract as:

Positive: contains genetic variant-related information useful for PubMind extraction.
Negative: does not contain useful genetic variant information.

This filtering step allows PubMind to focus downstream LLM-based extraction on literature passages that are more likely to contain variant, disease, phenotype, functional, or pathogenicity information.

Intended Use

This model is intended for biomedical literature triage before variant extraction. Example use cases include:

Filtering PubMed abstracts for likely variant-containing content.
Filtering PMC full-text paragraphs before LLM-based extraction.
Reducing the input size for large-scale literature mining pipelines.
Prioritizing biomedical text passages for downstream variant–disease–pathogenicity extraction.

Not Intended For

This model is not intended to directly perform clinical variant interpretation or make diagnostic decisions. It only predicts whether an input text passage is likely to contain useful variant-related information.

It should not be used as a standalone clinical decision-support system.

Model Architecture

Base model: distilbert/distilbert-base-uncased
Task: binary sequence classification
Input: biomedical paragraph or abstract text
Output: probability or label indicating whether the input contains useful variant information

Training Data

The model was fine-tuned using labeled biomedical text examples constructed for PubMind input filtering.

The primary training dataset contained approximately 1.5K labeled abstracts/passages, including:

Approximately 1,000 negative examples generated using regex-based rules.
Approximately 500 positive examples from the PubTator3.0 training corpus.

Additional BERT-family models and larger datasets were evaluated during PubMind development, including GoogleBERT and BioMedBERT. Larger fine-tuning datasets of approximately 15K abstracts were also generated and benchmarked using two labeling strategies:

Regex-label strategy
Labels were generated using rule-based detection of variant-like text patterns.
LLM-label strategy
A large set of PubMed abstracts was first processed by an LLM. Abstracts that produced useful variant outputs were labeled as positive, while abstracts without useful variant outputs were labeled as negative.

The released model corresponds to the PubMind input filtering model used to identify paragraphs or abstracts likely to contain variant information before downstream extraction.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "Wangwpi/PubMind_finetuned_BERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "The c.1012C>T variant in AP4M1 was reported in patients with neurological phenotypes."

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

predicted_class = torch.argmax(probs, dim=-1).item()
print("Predicted class:", predicted_class)
print("Probabilities:", probs)

Citation

If you use this model, please cite both PubMind and the original DistilBERT model:

@article{wang2025pubmind,
  title = {PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models},
  author = {Wang, Peng and Wang, Kai},
  journal = {bioRxiv},
  year = {2025},
  doi = {10.1101/2025.10.13.682183}
}

@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}

Contact

For questions about PubMind, please refer to the GitHub repository:

https://github.com/WGLab/PubMind

Downloads last month: 48

Safetensors

Model size

67M params

Tensor type

F32

Model tree for Wangwpi/PubMind_finetuned_BERT

Base model

distilbert/distilbert-base-uncased

Finetuned

(11445)

this model