PubMind Input Filtering BERT Model
Model Description
This model is a fine-tuned DistilBERT-based binary text classification model developed as part of PubMind, a literature-based framework for extracting genetic variant information, disease associations, pathogenicity classifications, functional annotations, and supporting evidence from biomedical literature.
PubMind is designed to process large-scale biomedical literature, including PubMed abstracts and PMC full texts, and identify text passages that contain useful genetic variant information. Because the full biomedical literature corpus is extremely large, this model was developed as an input filtering / triage model to reduce irrelevant text before downstream large language model extraction.
The model classifies each input paragraph or abstract as:
- Positive: contains genetic variant-related information useful for PubMind extraction.
- Negative: does not contain useful genetic variant information.
This filtering step allows PubMind to focus downstream LLM-based extraction on literature passages that are more likely to contain variant, disease, phenotype, functional, or pathogenicity information.
Intended Use
This model is intended for biomedical literature triage before variant extraction. Example use cases include:
- Filtering PubMed abstracts for likely variant-containing content.
- Filtering PMC full-text paragraphs before LLM-based extraction.
- Reducing the input size for large-scale literature mining pipelines.
- Prioritizing biomedical text passages for downstream variant–disease–pathogenicity extraction.
Not Intended For
This model is not intended to directly perform clinical variant interpretation or make diagnostic decisions. It only predicts whether an input text passage is likely to contain useful variant-related information.
It should not be used as a standalone clinical decision-support system.
Model Architecture
- Base model:
distilbert/distilbert-base-uncased - Task: binary sequence classification
- Input: biomedical paragraph or abstract text
- Output: probability or label indicating whether the input contains useful variant information
Training Data
The model was fine-tuned using labeled biomedical text examples constructed for PubMind input filtering.
The primary training dataset contained approximately 1.5K labeled abstracts/passages, including:
- Approximately 1,000 negative examples generated using regex-based rules.
- Approximately 500 positive examples from the PubTator3.0 training corpus.
Additional BERT-family models and larger datasets were evaluated during PubMind development, including GoogleBERT and BioMedBERT. Larger fine-tuning datasets of approximately 15K abstracts were also generated and benchmarked using two labeling strategies:
Regex-label strategy
Labels were generated using rule-based detection of variant-like text patterns.LLM-label strategy
A large set of PubMed abstracts was first processed by an LLM. Abstracts that produced useful variant outputs were labeled as positive, while abstracts without useful variant outputs were labeled as negative.
The released model corresponds to the PubMind input filtering model used to identify paragraphs or abstracts likely to contain variant information before downstream extraction.
How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "Wangwpi/PubMind_finetuned_BERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "The c.1012C>T variant in AP4M1 was reported in patients with neurological phenotypes."
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
print("Predicted class:", predicted_class)
print("Probabilities:", probs)
Citation
If you use this model, please cite both PubMind and the original DistilBERT model:
@article{wang2025pubmind,
title = {PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models},
author = {Wang, Peng and Wang, Kai},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.13.682183}
}
@article{Sanh2019DistilBERTAD,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal={ArXiv},
year={2019},
volume={abs/1910.01108}
}
Contact
For questions about PubMind, please refer to the GitHub repository:
- Downloads last month
- 48
Model tree for Wangwpi/PubMind_finetuned_BERT
Base model
distilbert/distilbert-base-uncased