Skip to content

LAION

Germany (global) · laion.ai · AI stance published ↗ · open
open datasetsopen modelsresearch infrastructure

Non-profit; created LAION-5B dataset; strong open-science advocacy.

PALS scores

Preservative dimensions

PALS composite
4.0
Mean of three dimensions, 1–10.
Completeness
6.0
Sources, limits, transparency.
Multiplicity
4.0
Epistemologies, languages, voices.
Responsibility
2.0
Accountability, refusal, governance.
Eight lenses

What's missing, by lens

Each lens carries a canonical question and corrects a specific epistemic failure. Score, findings, and gaps land once the audit runs.

Lens 01
Indigenous Knowledge
Whose knowledge is missing?
1/10
Findings (2)
  • No reference to Indigenous data sovereignty, the CARE Principles, or any community-controlled data governance anywhere in the homepage or LAION-5B documentation.
  • The operative model is the inverse of relational stewardship: bulk extraction of 5.85 billion image-text pairs scraped from the open web with no provenance, consent, or community-of-origin tracking.
Gaps (3)
  • No CARE (Collective benefit, Authority to control, Responsibility, Ethics) framing to sit alongside the FAIR/openness framing.
  • No mechanism by which a community whose images, ceremonies, or sacred imagery were scraped could assert authority, withdraw material, or even be notified.
  • Oral, embodied, and non-textual knowledge is structurally invisible to an image-text-pair pipeline; nothing acknowledges this absence.
Justification

Openness here is the opposite pole from preservative care. A consentless mass-scrape of the global web maximally exposes Indigenous and community-held imagery to extraction while offering zero sovereignty mechanism. Lowest score.

Lens 02
Deep History
What historical process produced this?
3/10
Findings (2)
  • Some historical self-positioning: LAION frames itself against proprietary, resource-duplicating AI, advocating 'reusing existing datasets and models' for environmental reasons.
  • Implicit awareness of the GPU/compute geopolitics that motivate 'liberating' ML research from well-capitalised incumbents.
Gaps (3)
  • No acknowledgment that web-scraping inherits a long colonial-extractive logic — taking a commons without reciprocity and re-enclosing the value.
  • No discussion of the invisible labour history behind 'open' data: the uncompensated creators whose images form the corpus, or downstream Global-South content moderators.
  • Regulatory and legal history (EU copyright, the German legal context, later CSAM findings) is absent from the stance documents themselves.
Justification

There is a thin historical narrative — democratisation versus enclosure — but it is self-flattering and omits the extractive lineage of mass scraping and the labour history embedded in the corpus. Below midpoint.

Lens 03
Cross-Cultural Wisdom
Which perspectives have been flattened?
4/10
Findings (2)
  • Genuine multilingual breadth at the data level: 2.2B multilingual samples spanning 100+ languages, with named distributions (Russian 241M, French 168M, German 150M, Spanish 149M, Chinese 143M, Japanese 131M).
  • Framing positions multilingual coverage as 'global research accessibility' and 'democratising' access across language communities.
Gaps (3)
  • Breadth is quantitative, not relational: no cultural scholars, no community translators, no preservation of culturally specific reasoning — just language-tagged pairs harvested at scale.
  • The heavy skew toward a handful of European and East-Asian high-resource languages, with 1.2B 'non-assignable' samples, reproduces existing web inequities rather than correcting them.
  • CLIP-filtering itself encodes an English/Western visual-semantic prior; non-Western imagery is judged by a model trained on Western alignment, flattening it without comment.
Justification

Real multilingual quantity earns more than token presence, but it is coverage-as-counting, not cross-cultural wisdom. The CLIP prior and unexamined language skew flatten precisely what the lens asks about. Mid-low.

Lens 04
Scientific Evidence
What does the evidence show, and what are its limits?
6/10
Findings (3)
  • Strongest lens by design: open weights, open metadata (CC-BY 4.0), open tooling (img2dataset, clip-retrieval) and published safety classifiers enable genuine third-party replication and verification.
  • Explicit, quantified limitation disclosure: NSFW classifier reported at 96% accuracy on a balanced test set; watermark detector described; uncurated-content warning issued.
  • Honest deployment caveat: recommends against industrial product use 'until foundational safety research matures.'
Gaps (3)
  • No independent third-party audit is cited in the stance documents; the safety claims are self-reported by the dataset's own authors.
  • The well-documented later finding of CSAM material within LAION-5B (Stanford Internet Observatory, leading to the dataset's takedown and re-release as Re-LAION) exposes that the self-reported NSFW filtering materially failed — a falsification the original stance page does not and cannot acknowledge.
  • '96% accuracy' is presented without the precision/recall or base-rate context that matters at billion-sample scale, where a 4% miss is hundreds of millions of items.
Justification

Open weights and quantified, caveated claims are real scientific virtues and the lab's clear strength. But the absence of independent audit, the base-rate-naive accuracy figure, and the subsequent CSAM finding that falsified the safety story cap this well short of high. Upper-middle, not more.

Lens 05
Artistic Perception
What does this feel like, not just mean?
2/10
Findings (2)
  • The dataset powers the entire generative-art ecosystem (CLIP, DALL-E, GLIDE, BLIP, Stable Diffusion lineage), so it is foundational to machine artistic production.
  • A single affective register appears: the warning that links may lead to 'strongly discomforting and disturbing content' acknowledges that data has emotional weight.
Gaps (3)
  • No acknowledgment of the affective or moral injury to the human artists whose work was scraped without consent into the corpus that now competes with them.
  • No space for ambiguity or poetic uncertainty; imagery is treated as pairs to be filtered for efficiency, not as expression with feeling or authorship.
  • The emotional labour of moderators and of people whose traumatic images sit in the set is unrecognised beyond a one-line content warning.
Justification

An image dataset that fuels generative art yet treats every image as an interchangeable pair, and never reckons with authorship, feeling, or harm to depicted persons, scores low on perceiving what the data feels like rather than means.

Lens 06
Future Modelling
Where is this heading, and for whom?
3/10
Findings (2)
  • Environmental future is gestured at via the 'reuse existing datasets and models' efficiency argument.
  • A precautionary stance toward deployment futures: explicit recommendation against industrial use pending safety research maturity.
Gaps (3)
  • No engagement with labour-displacement futures for the creative workers whose scraped output trains the displacing models.
  • No quantified environmental disclosure — only a qualitative reuse claim, no compute or carbon figures.
  • No democratic or participatory governance of the systems built on the data; 'release and recommend caution' offloads downstream futures onto unaccountable adopters.
Justification

There is a precautionary gesture and a reuse-efficiency claim, but whose futures are being shaped (artists, depicted persons, the Global South) is unaddressed, and the governance model is release-and-disclaim. Below midpoint.

Lens 07
Marginalised Voices
Who is not at the table?
2/10
Findings (2)
  • Open access nominally lowers barriers for under-resourced researchers worldwide via free tooling (clip-retrieval, img2dataset) and free weights.
  • Community engagement is real but narrow: Discord and GitHub for 'everyone who is kind and passionate about machine learning.'
Gaps (3)
  • The people most affected — those whose private, medical, traumatic, or (as later found) child-sexual-abuse images were scraped — have no representation, consent, redress, or compensated channel.
  • No participatory design with Global-South developers, no disability/accessibility commitments, no labour-representative engagement.
  • 'Kind and passionate about machine learning' self-selects an already-technical in-group; the marginalised are objects of the dataset, never partners in governing it.
Justification

Openness as a download barrier removed is not the same as voice. The most marginalised here are the non-consenting depicted, who have no seat at the table and were later shown to include abuse victims. Near the floor.

Lens 08
Trickster Knowledge
What truth appears when the story is inverted?
4/10
Findings (2)
  • The lab does hold one genuine contradiction in plain sight: it ships a sweeping safety/openness mission while simultaneously admitting its flagship artefact is 'uncurated' and may surface 'disturbing content' — an honesty most polished labs would smooth away.
  • Invert the slogan and the audit writes itself: 'TRULY OPEN AI. 100% NON-PROFIT. 100% FREE' read against the depicted-but-unconsenting subjects becomes 'truly open about other people's images, free for the scraper, costed to the scraped.'
Gaps (3)
  • The lab does not treat its own seriousness as exempt-able — it never turns irony on its founding premise that openness equals virtue, the premise the CSAM finding most needed it to question.
  • '100% FREE' is an unexamined boast: free to whom, at whose expense? The all-caps certainty is the opposite of trickster self-suspicion.
  • No structural space where the open-data narrative is allowed to be tested by its own opposite (preservative refusal, the right not to be collected).
Justification

One honest contradiction is admitted (the uncurated warning), which is more than many labs offer, but the all-caps certitude of the mission is never itself inverted or audited. Mid-low: candour about content, none about creed.

Suffixscape

Linguistic diagnostics

Regex- and LLM-detected patterns of evasion in the lab's own prose: nominalised evasion, agency diffusion, epistemic inflation, temporal flatness. Distinct from the CognioNews -scape editorial format — see methodology.

Pattern Quote Effect Preservative alternative
epistemic inflation "TRULY OPEN AI. 100% NON-PROFIT. 100% FREE" All-caps absolute superlatives ('TRULY', '100%') manufacture certainty and moral closure, foreclosing the question of who bears the uncosted externalities (non-consenting depicted persons, scraped artists). 'Free' for the user reads as free of cost, hiding that the cost was simply transferred to those collected. "Openly licensed and non-profit. Free to download — but built from web images whose subjects did not consent, a cost we are still accounting for."
agency diffusion "collected links may lead to strongly discomforting and disturbing content" Passive/inanimate framing ('collected links may lead') erases the actor who collected them and the decision to release them uncurated. The links acquire agency; LAION recedes from responsibility for the harm. "We scraped these links indiscriminately and chose to release them with only automated filtering; as a result we are knowingly distributing pointers to disturbing and, we later learned, illegal content."
nominalised evasion "The motivation behind dataset creation is to democratize research and experimentation" 'Dataset creation' nominalises the act of mass web-scraping, hiding the actor, the consent question, and the extraction. 'Democratize' borrows political legitimacy for what is technically a non-consensual harvest. "We scraped billions of public web images to lower the cost of large-model research for less-resourced labs — without seeking consent from the people and creators whose images we took."
epistemic inflation "A CLIP-based classifier achieving 96% accuracy on balanced test sets" A clean superlative-like figure ('96% accuracy on balanced test sets') implies the safety problem is largely solved, while concealing that on a 5.85-billion-item real-world (not balanced) corpus a 4% miss is hundreds of millions of items — the gap through which CSAM later entered. "Our NSFW classifier reaches 96% on a balanced benchmark; at billion-scale on the live web this still lets through millions of mis-filtered items, and independent audits later found illegal content we missed."
Audit history

Prior audits

Latest audit: 2026-06-08 · sources: https://laion.ai, https://laion.ai/blog/laion-5b/

Transparency

Raw data

Every audit is published as machine-readable JSON. You can read this lab's latest report at /stancewatch/api/labs/laion.json — it carries the per-lens findings, evidence quotes, Suffixscape flags, PALS scores, the sources actually read, and a confidence note.

Found an error, or a stance page we missed? We audit public communications only — point us to the page and the next audit will read it. Write to hello@cognioengine.co.uk.

Audit date: 2026-06-08

Moderate-to-good confidence. Both target URLs (homepage and the LAION-5B stance page) were successfully fetched and quoted directly, so lens findings rest on primary text rather than public memory. The original stance page predates the 2023 Stanford Internet Observatory CSAM finding and the Re-LAION re-release; those well-documented controversies are weighed from public knowledge and were not on the audited pages, so scores for scientific_evidence, marginalised_voices and responsibility are deliberately held low to reflect what the public record (not the lab's self-presentation) shows. Qualitative judgment; not a validated metric.

Auditor: GoldBerry v1.3 / StanceWatch v1.0