Findings (2)
- The Aya initiative explicitly targets 'more than 50 previously underserved languages' and supports 'low-resource languages', which structurally creates space for language communities typically excluded from NLP, some of which carry Indigenous and oral knowledge.
- Region-tuned variants (Tiny Aya Earth for African and West Asian languages, Tiny Aya Fire for South Asian languages) acknowledge that linguistic geography is uneven and that one model does not fit all communities.
Gaps (4)
- No mention of the CARE Principles for Indigenous Data Governance (Collective benefit, Authority to control, Responsibility, Ethics) anywhere in the visible text.
- No reference to Indigenous data sovereignty, community ownership of language corpora, or the right of a language community to withdraw its data.
- Oral traditions and non-textual knowledge are not addressed; the framing is text/instruction-tuning corpora, which privileges written language and can be extractive toward oral-first cultures.
- 'Underserved' framing positions communities as recipients of AI rather than as authorities over their own knowledge.
Justification
Score above the typical floor because the entire Aya program is oriented toward languages that mainstream labs ignore, and 3,000+ distributed contributors is materially more participatory than a Western core lab. But low-resource-language inclusion is not the same as Indigenous data sovereignty: there is no CARE language, no community-authority or withdrawal mechanism, and an oral-knowledge blind spot. A 4 reflects genuine structural openness without the governance scaffolding that would make it non-extractive.