Findings (3)
- BigScience's multilingual scope (BLOOM's 46 natural languages including under-resourced languages such as Wolof, Lingala, Setswana, and several Indian and African languages) materially widened representation beyond the usual high-resource set, which indirectly serves communities whose languages are normally excluded.
- The ROOTS corpus governance and the BigScience Data working group engaged in deliberate sourcing decisions and documented data provenance, a precondition for any sovereignty conversation.
- Community-led language selection: language communities and native-speaker researchers were involved in deciding which languages entered BLOOM, rather than scraping by availability alone.
Gaps (4)
- No explicit adoption of the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, Ethics).
- No evidence of formal Indigenous data sovereignty protocols or consent frameworks for the communities whose language data entered ROOTS.
- Oral, non-textual, and relational knowledge is structurally absent — BLOOM is a text corpus model, so embodied and oral traditions are excluded by construction.
- Multilingual breadth is not the same as data sovereignty; inclusion of a language does not grant the speaking community authority over its use.
Justification
BigScience earns more than a floor score because its multilingual, community-involved language selection is a genuine, rare gesture toward represented-but-usually-excluded communities. But inclusion is treated as the achievement; sovereignty (authority to control downstream use, CARE adoption, consent) is not addressed. Score reflects real inclusivity without the governance layer that would make it preservative.