---
language:
- te
license: apache-2.0
base_model: emilyalsentzer/Bio_ClinicalBERT
tags:
- token-classification
- ner
- pii
- pii-detection
- de-identification
- privacy
- healthcare
- medical
- clinical
- phi
- telugu
- pytorch
- transformers
- openmed
pipeline_tag: token-classification
library_name: transformers
metrics:
- f1
- precision
- recall
model-index:
- name: OpenMed-PII-Telugu-BioClinicalBERT-110M-v1
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: AI4Privacy (Telugu subset)
type: ai4privacy/pii-masking-400k
split: test
metrics:
- type: f1
value: 0.8683
name: F1 (micro)
- type: precision
value: 0.8887
name: Precision
- type: recall
value: 0.8489
name: Recall
widget:
- text: "డా. రాజేష్ శర్మ (ఆధార్: 1234 5678 9012) ను rajesh.sharma@hospital.in లేదా +91 98765 43210 లో సంప్రదించవచ్చు. చిరునామా: 42 గాంధీ రోడ్, 500001 హైదరాబాద్."
example_title: Clinical Note with PII (Telugu)
---
# OpenMed-PII-Telugu-BioClinicalBERT-110M-v1
**Telugu PII Detection Model** | 110M Parameters | Open Source
[]() []() []()
## Model Description
**OpenMed-PII-Telugu-BioClinicalBERT-110M-v1** is a transformer-based token classification model fine-tuned for **Personally Identifiable Information (PII) detection in Telugu text**. This model identifies and classifies **54 types of sensitive information** including names, addresses, social security numbers, medical record numbers, and more.
### Key Features
- **Telugu-Optimized**: Specifically trained on Telugu text for optimal performance
- **High Accuracy**: Achieves strong F1 scores across diverse PII categories
- **Comprehensive Coverage**: Detects 55+ entity types spanning personal, financial, medical, and contact information
- **Privacy-Focused**: Designed for de-identification and compliance with GDPR and other privacy regulations
- **Production-Ready**: Optimized for real-world text processing pipelines
## Performance
Evaluated on the Telugu subset of AI4Privacy dataset:
| Metric | Score |
|:---|:---:|
| **Micro F1** | **0.8683** |
| Precision | 0.8887 |
| Recall | 0.8489 |
| Macro F1 | 0.8703 |
| Weighted F1 | 0.8658 |
| Accuracy | 0.9484 |
### Top 10 Telugu PII Models
| Rank | Model | F1 | Precision | Recall |
|:---:|:---|:---:|:---:|:---:|
| 1 | [OpenMed-PII-Telugu-SuperClinical-Large-434M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-SuperClinical-Large-434M-v1) | 0.9525 | 0.9521 | 0.9528 |
| 2 | [OpenMed-PII-Telugu-SnowflakeMed-Large-568M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-SnowflakeMed-Large-568M-v1) | 0.9507 | 0.9508 | 0.9507 |
| 3 | [OpenMed-PII-Telugu-BigMed-Large-560M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-BigMed-Large-560M-v1) | 0.9505 | 0.9504 | 0.9507 |
| 4 | [OpenMed-PII-Telugu-SuperMedical-Large-355M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-SuperMedical-Large-355M-v1) | 0.9494 | 0.9492 | 0.9495 |
| 5 | [OpenMed-PII-Telugu-ClinicalBGE-568M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-ClinicalBGE-568M-v1) | 0.9485 | 0.9485 | 0.9485 |
| 6 | [OpenMed-PII-Telugu-mClinicalE5-Large-560M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-mClinicalE5-Large-560M-v1) | 0.9474 | 0.9468 | 0.9480 |
| 7 | [OpenMed-PII-Telugu-NomicMed-Large-395M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-NomicMed-Large-395M-v1) | 0.9417 | 0.9417 | 0.9416 |
| 8 | [OpenMed-PII-Telugu-SuperClinical-Base-184M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-SuperClinical-Base-184M-v1) | 0.9414 | 0.9413 | 0.9416 |
| 9 | [OpenMed-PII-Telugu-mSuperClinical-Base-279M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-mSuperClinical-Base-279M-v1) | 0.9414 | 0.9418 | 0.9410 |
| 10 | [OpenMed-PII-Telugu-ModernMed-Large-395M-v1](https://huggingface.co/OpenMed/OpenMed-PII-Telugu-ModernMed-Large-395M-v1) | 0.9361 | 0.9357 | 0.9365 |
## Supported Entity Types
This model detects **54 PII entity types** organized into categories:
Identifiers (22 types)
| Entity | Description |
|:---|:---|
| `ACCOUNTNAME` | Accountname |
| `BANKACCOUNT` | Bankaccount |
| `BIC` | Bic |
| `BITCOINADDRESS` | Bitcoinaddress |
| `CREDITCARD` | Creditcard |
| `CREDITCARDISSUER` | Creditcardissuer |
| `CVV` | Cvv |
| `ETHEREUMADDRESS` | Ethereumaddress |
| `IBAN` | Iban |
| `IMEI` | Imei |
| ... | *and 12 more* |
Personal Info (11 types)
| Entity | Description |
|:---|:---|
| `AGE` | Age |
| `DATEOFBIRTH` | Dateofbirth |
| `EYECOLOR` | Eyecolor |
| `FIRSTNAME` | Firstname |
| `GENDER` | Gender |
| `HEIGHT` | Height |
| `LASTNAME` | Lastname |
| `MIDDLENAME` | Middlename |
| `OCCUPATION` | Occupation |
| `PREFIX` | Prefix |
| ... | *and 1 more* |
Contact Info (2 types)
| Entity | Description |
|:---|:---|
| `EMAIL` | Email |
| `PHONE` | Phone |
Location (9 types)
| Entity | Description |
|:---|:---|
| `BUILDINGNUMBER` | Buildingnumber |
| `CITY` | City |
| `COUNTY` | County |
| `GPSCOORDINATES` | Gpscoordinates |
| `ORDINALDIRECTION` | Ordinaldirection |
| `SECONDARYADDRESS` | Secondaryaddress |
| `STATE` | State |
| `STREET` | Street |
| `ZIPCODE` | Zipcode |
Organization (3 types)
| Entity | Description |
|:---|:---|
| `JOBDEPARTMENT` | Jobdepartment |
| `JOBTITLE` | Jobtitle |
| `ORGANIZATION` | Organization |
Financial (5 types)
| Entity | Description |
|:---|:---|
| `AMOUNT` | Amount |
| `CURRENCY` | Currency |
| `CURRENCYCODE` | Currencycode |
| `CURRENCYNAME` | Currencyname |
| `CURRENCYSYMBOL` | Currencysymbol |
Temporal (2 types)
| Entity | Description |
|:---|:---|
| `DATE` | Date |
| `TIME` | Time |
## Usage
### Quick Start
```python
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="OpenMed/OpenMed-PII-Telugu-BioClinicalBERT-110M-v1", aggregation_strategy="simple")
text = """
రోగి రాజేష్ కుమార్ (పుట్టిన తేదీ: 15/03/1985, ఆధార్: 9876 5432 1098) ను నేడు పరీక్షించారు.
సంప్రదింపు: rajesh.kumar@email.in, ఫోన్: +91 98765 43210.
చిరునామా: 123 విజయ రోడ్, 500034 హైదరాబాద్.
"""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
```
### De-identification Example
```python
def redact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""
# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
```
### Batch Processing
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "OpenMed/OpenMed-PII-Telugu-BioClinicalBERT-110M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"రోగి రాజేష్ కుమార్ (పుట్టిన తేదీ: 15/03/1985, ఆధార్: 9876 5432 1098) ను నేడు పరీక్షించారు.",
"సంప్రదింపు: rajesh.kumar@email.in, ఫోన్: +91 98765 43210.",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
```
## Training Details
### Dataset
- **Source**: [AI4Privacy PII Masking 400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k) (Telugu subset)
- **Format**: BIO-tagged token classification
- **Labels**: 109 total (54 entity types × 2 BIO tags + O)
### Training Configuration
- **Max Sequence Length**: 512 tokens
- **Epochs**: 3
- **Framework**: Hugging Face Transformers + Trainer API
## Intended Use & Limitations
### Intended Use
- **De-identification**: Automated redaction of PII in Telugu clinical notes, medical records, and documents
- **Compliance**: Supporting GDPR, and other privacy regulation compliance
- **Data Preprocessing**: Preparing datasets for research by removing sensitive information
- **Audit Support**: Identifying PII in document collections
### Limitations
**Important**: This model is intended as an **assistive tool**, not a replacement for human review.
- **False Negatives**: Some PII may not be detected; always verify critical applications
- **Context Sensitivity**: Performance may vary with domain-specific terminology
- **Language**: Optimized for Telugu text; may not perform well on other languages
## Citation
```bibtex
@misc{openmed-pii-2026,
title = {OpenMed-PII-Telugu-BioClinicalBERT-110M-v1: Telugu PII Detection Model},
author = {OpenMed Science},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/OpenMed/OpenMed-PII-Telugu-BioClinicalBERT-110M-v1}
}
```
## Links
- **Organization**: [OpenMed](https://huggingface.co/OpenMed)