| Field: | Response: |
|---|---|
| Generatable or Reverse engineerable personally-identifiable information? | None |
| Was consent obtained for any personal data used? | None Known |
| Personal data used to create this model? | None Known |
| How often is dataset reviewed? | Before Release |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Applicable NVIDIA Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Privacy (Privacy Subcard)
We employ automated tools and data processing techniques to scan datasets for text segments matching patterns associated with potential Personally Identifiable Information (PII) during pre-training to identify and filter certain categories of personal information, including phone numbers, email addresses, and government IDs. Scans of the Dolma, Buzz-V1.2, and FineWeb4 datasets detected no PII in their 3,000-sample sets. Microsoft Presidio indicated potential privacy risks such as flagged items in HelpSteer 3 that were ultimately found to be non-sensitive placeholder data in code. Verified instances of PII, including an email address and a phone number found in the C4 dataset, were [removed using automated filtering techniques, human-in-the-loop review, and redaction pipelines]. This evaluation used a 3,000-sample subset per dataset, identified as the optimal threshold for maximizing embedder accuracy.