| Field: | Response: |
|---|---|
| Participation considerations from adversely impacted groups (protected classes) in model design and testing: | None |
| Measures taken to mitigate against unwanted bias: | None |
Representational Bias and Fairness (Bias Subcard)
These datasets, such as C4, Dolma, and FineWeb4, do not collectively or exhaustively represent all demographic groups (and proportionally therein). For instance, over 83% of C4 samples lacked age mentions, and over 98% of FineWeb4 samples had no references to disability. C4, Dolma, and FineWeb4 contain representational skews—for example, references to "male" significantly outnumber those to "female," and mentions of "White" are the most frequent among ethnic identifiers. To mitigate these, we recommend considering evaluation, fine-tuning, and mitigation techniques to align with the desired model behavior. This evaluation used a 3,000-sample subset per dataset, identified as the optimal threshold for maximizing embedder accuracy, and includes outputs from uncalibrated embedders; as such, certain limitations may exist in the reliability of the embeddings.