itlevy's picture
Add files using upload-large-folder tool
4c23945 verified
Field: Response:
Generatable or Reverse engineerable personally-identifiable information? None
Was consent obtained for any personal data used? None Known
Personal data used to create this model? None Known
How often is dataset reviewed? Before Release
Is there provenance for all datasets used in training? Yes
Does data labeling (annotation, metadata) comply with privacy laws? Yes
Applicable NVIDIA Privacy Policy https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Privacy (Privacy Subcard)

We employ automated tools and data processing techniques to scan datasets for text segments matching patterns associated with potential Personally Identifiable Information (PII) during pre-training to identify and filter certain categories of personal information, including phone numbers, email addresses, and government IDs. Scans of the Dolma, Buzz-V1.2, and FineWeb4 datasets detected no PII in their 3,000-sample sets. Microsoft Presidio indicated potential privacy risks such as flagged items in HelpSteer 3 that were ultimately found to be non-sensitive placeholder data in code. Verified instances of PII, including an email address and a phone number found in the C4 dataset, were [removed using automated filtering techniques, human-in-the-loop review, and redaction pipelines]. This evaluation used a 3,000-sample subset per dataset, identified as the optimal threshold for maximizing embedder accuracy.