Language Understanding and Pragmatics Psychology and Social Cognition

How much of the internet is AI-generated now?

What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?

Note · 2026-04-18 · sourced from Social Theory Society
Why do AI systems fail at social and cultural interpretation? What kind of thing is an LLM really?

A representative sample of websites from the Internet Archive (2022-2025) measured with a state-of-the-art AI text detector finds that "roughly 35% of newly published websites were classified as AI-generated or AI-assisted" by mid-2025, up from zero before ChatGPT's launch in late 2022. This is the first large-scale empirical baseline for a phenomenon previously discussed only through anecdote and speculation (the "Dead Internet Theory").

What the data shows:

The perception gap. A user study found that the majority of US adults believe all four hypotheses (reduced semantic diversity, increased positive sentiment, decreased factual accuracy, decreased stylistic diversity). People who do not use AI or use it infrequently believe in the negative impacts more; those who hold negative views of AI believe more strongly in the hypotheses. The perception of harm exceeds the measured harm on two of four dimensions — but is validated on the other two. Public fear is neither paranoia nor prophecy; it is half right.

The semantic diversity finding is the key result. Stylistic diversity is preserved — the words vary — but semantic diversity declines. This mirrors the pattern from since Why do different LLMs generate nearly identical outputs?: surface variation masks idea convergence. The internet is saying the same things in different ways.

Connection to model collapse. Since Does training on AI-generated content permanently degrade model quality?, the 35% AI content baseline establishes the starting condition for recursive degradation. If future models train on web crawls that are already one-third AI-generated, the tail distribution loss accelerates. The semantic diversity decline measured here may be the early empirical signal of model collapse manifesting in the wild, not in lab experiments.

The positive sentiment bias confirms what the homogeneity research predicts: AI output defaults to agreeable, constructive, and upbeat framing. Since Does AI homogenize culture the way mass media did?, the sentiment shift represents the AI culture industry's affective signature — systematically positive, systematically inoffensive, systematically unremarkable.

The factual accuracy non-finding is surprising given hallucination concerns but may reflect selection effects: AI-generated websites that contain obvious factual errors may be less likely to persist in the archive, or factual accuracy may be domain-dependent in ways the aggregate measure misses.


Source: Social Theory Society Paper: The Impact of AI-Generated Text on the Internet

Related concepts in this collection

Concept map
15 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

35 percent of new websites are AI-generated by mid-2025 — semantic diversity declines and positive sentiment rises but factual accuracy and stylistic diversity are unaffected