Language Understanding and Pragmatics Psychology and Social Cognition

How much of the internet is AI-generated now?

What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?

Note · 2026-04-18 · sourced from Social Theory Society

A representative sample of websites from the Internet Archive (2022-2025) measured with a state-of-the-art AI text detector finds that "roughly 35% of newly published websites were classified as AI-generated or AI-assisted" by mid-2025, up from zero before ChatGPT's launch in late 2022. This is the first large-scale empirical baseline for a phenomenon previously discussed only through anecdote and speculation (the "Dead Internet Theory").

What the data shows:

Semantic diversity correlates negatively with AI text prevalence — ideas converge as AI content grows
Positive sentiment correlates positively with AI text prevalence — the internet gets more upbeat
Factual accuracy shows no statistically significant change
Stylistic diversity shows no statistically significant change

The perception gap. A user study found that the majority of US adults believe all four hypotheses (reduced semantic diversity, increased positive sentiment, decreased factual accuracy, decreased stylistic diversity). People who do not use AI or use it infrequently believe in the negative impacts more; those who hold negative views of AI believe more strongly in the hypotheses. The perception of harm exceeds the measured harm on two of four dimensions — but is validated on the other two. Public fear is neither paranoia nor prophecy; it is half right.

The semantic diversity finding is the key result. Stylistic diversity is preserved — the words vary — but semantic diversity declines. This mirrors the pattern from since Why do different LLMs generate nearly identical outputs?: surface variation masks idea convergence. The internet is saying the same things in different ways.

Connection to model collapse. Since Does training on AI-generated content permanently degrade model quality?, the 35% AI content baseline establishes the starting condition for recursive degradation. If future models train on web crawls that are already one-third AI-generated, the tail distribution loss accelerates. The semantic diversity decline measured here may be the early empirical signal of model collapse manifesting in the wild, not in lab experiments.

The positive sentiment bias confirms what the homogeneity research predicts: AI output defaults to agreeable, constructive, and upbeat framing. Since Does AI homogenize culture the way mass media did?, the sentiment shift represents the AI culture industry's affective signature — systematically positive, systematically inoffensive, systematically unremarkable.

The factual accuracy non-finding is surprising given hallucination concerns but may reflect selection effects: AI-generated websites that contain obvious factual errors may be less likely to persist in the archive, or factual accuracy may be domain-dependent in ways the aggregate measure misses.

Source: Social Theory Society Paper: The Impact of AI-Generated Text on the Internet

Related concepts in this collection

Why do different LLMs generate nearly identical outputs? Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.
semantic convergence despite stylistic variety; the mechanism behind declining semantic diversity
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
35% AI content is the baseline for recursive degradation
Does AI homogenize culture the way mass media did? If AI generates contextually unique outputs, how can its underlying form be homogeneous? This explores whether AI repeats the culture industry's pattern of suppressing novelty under the guise of variety.
positive sentiment bias as affective signature of the AI culture industry
Can humans detect AI writing if it looks natural? Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.
the detection gap: text is statistically distinguishable but pragmatically indistinguishable
Why do fake news detectors flag AI-generated truthful content? Explores why systems trained to detect deception misclassify LLM-generated text as fake. The bias may stem from AI linguistic patterns rather than content veracity, raising questions about what these detectors actually measure.
AI detection as proxy for style detection, not truth detection

Concept map

15 direct connections · 147 in 2-hop network ·dense cluster

How much of the internet is AI-generated now? Why do different LLMs generate nearly identical ou… Does training on AI-generated content permanently … Does AI homogenize culture the way mass media did? Can humans detect AI writing if it looks natural? Why do fake news detectors flag AI-generated truth…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

35 percent of new websites are AI-generated by mid-2025 — semantic diversity declines and positive sentiment rises but factual accuracy and stylistic diversity are unaffected