Findings (2)
- Heavy emphasis on multilingual coverage ('92 major official languages', '100+ languages and dialects') which gestures toward linguistic breadth.
- Open-weight Apache 2.0 distribution theoretically permits community-led adaptation, including by Indigenous-language groups, without permission.
Gaps (4)
- No mention of Indigenous data sovereignty or the CARE Principles for Indigenous Data Governance.
- 'Dialects' are treated as coverage metrics, not as living relational knowledge systems with their own custodians.
- No consultation process, attribution, or benefit-sharing for the language communities whose text is scraped into training corpora.
- No acknowledgment of oral/non-textual knowledge or the extractive nature of corpus collection.
Justification
Language breadth is framed purely as market/population reach ('95% of the global population'). There is no recognition of data sovereignty, custodianship, or consent. The framing is extractive-by-default: languages are inputs, not communities. Low score, with two points for the genuine multilingual reach and open license that at least make community reuse possible.