YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

clinical-ner-deid-biobert-v2

Model Overview

This model is a highly specialized Named Entity Recognition (NER) model fine-tuned on BioBERT, designed for the crucial task of Clinical De-identification (De-ID). Its primary function is to identify and classify Protected Health Information (PHI) within unstructured clinical notes, such as patient names, dates, ages, and medical record IDs, which is essential for compliance with regulations like HIPAA.

Model Architecture

  • Base Model: BioBERT-v1.1 (based on BERT-large, pre-trained on biomedical corpora like PubMed abstracts and PMC full-text articles).
  • Modification: The base model is adapted for Token Classification (BertForTokenClassification). A linear classification head is placed on top of the last hidden state of every input token.
  • NER Scheme: Uses the standard IOB2 tagging scheme to distinguish between the beginning (B-), inside (I-), and outside (O) of an entity.
  • Target Entities (Labels):
    • NAME (Patient/Doctor Names)
    • AGE (Specific Age/Year)
    • DATE (Admission, Discharge, Test Dates)
    • ID (Medical Record Numbers, Social Security Numbers)

Intended Use

  • PHI Redaction: Automatically flagging and redacting sensitive data in clinical trial reports or public health datasets.
  • Data Anonymization: Preparing patient notes for secondary use, such as machine learning training or academic research.
  • Compliance: Ensuring regulatory adherence (e.g., HIPAA in the US, GDPR in the EU) when handling clinical documentation.

Limitations and Ethical Considerations

  • False Positives/Negatives: While highly accurate, the model is not 100% reliable. Manual review by a human expert is mandatory for production use, especially for regulatory compliance.
  • Contextual Ambiguity: It may struggle with highly ambiguous terms (e.g., "The patient's name is May," where 'May' could be a name or a month).
  • PHI Definition Drift: Regulations and the definition of PHI can change. The model must be periodically re-evaluated against new standards.
  • Language: This model is strictly trained and tested for English clinical notes.

Example Code

To use the model for de-identification:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "YourOrg/clinical-ner-deid-biobert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create the NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

clinical_note = "Patient Alex Johnson, aged 65, was admitted on 2024-11-15 with MRN 98765432."

results = ner_pipeline(clinical_note)

print("--- Detected PHI Entities ---")
for result in results:
    entity_type = result['entity_group']
    word = result['word']
    print(f"Type: {entity_type:<5} | Value: {word}")

# Output structure allows for easy PHI redaction/replacement
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support