Does statistical rarity actually correlate with originality that law should protect?

This explores whether 'statistically rare' and 'legally original' are the same thing — whether measuring how unusual a text is in some feature space can stand in for the human authorship copyright law actually protects.

This explores whether statistical rarity is a good proxy for the kind of originality law is meant to protect — and the corpus suggests the answer is 'useful, but don't mistake the proxy for the thing.' The strongest case for yes comes from StoryScope, which operationalizes originality as rarity in the space of discourse-level narrative choices and finds human stories genuinely occupy rarer regions while AI outputs cluster tightly together Can statistical rarity measure whether stories are truly original?. That's a real signal: it gives copyright's fuzzy 'human conception' requirement something measurable to point at, and it lines up with the observation that independent models converge on similar outputs despite competing, homogenizing culture in ways invisible to any single user Does AI homogenize culture the way mass media did?.

But the corpus also shows rarity measuring things that have nothing to do with protectable originality. In curriculum training, rare data is treated as a sign of distributional weakness — a gap from the pre-training distribution to be patched — not as conceptual value Does ordering training data by rarity actually improve language models?. In retrieval systems, rarity is just a failure-mode detector, flagging where a model is likely to hallucinate about uncommon entities Should RAG systems use model confidence or data rarity to trigger retrieval?. Same statistic, opposite meaning: there, being rare makes something a liability, not a contribution. So rarity alone can't tell you whether you're looking at a creative leap or a data hole.

The sharpest crack appears when you separate 'novel' from 'valuable.' LLMs can generate research ideas rated statistically *more* novel than expert ideas — while scoring lower on feasibility, because expert knowledge constrains novelty toward what actually works Do language models generate more novel research ideas than experts?. Rarity rewards the unconstrained wandering; the thing we usually mean by 'original and worth protecting' includes the discipline that makes rarity meaningful rather than merely odd. This is why structured novelty assessment — extract the claims, retrieve the prior art, compare — aligns far better with human reviewers than any holistic 'how unusual does this feel' measure Can structured pipelines make LLM novelty assessment reliable?. Originality judgments humans trust are relational and contextual, not a single distance-from-the-mean number.

There's also a deeper objection the corpus raises: law may protect something rarity *cannot see at all*. The argument that AI output carries only 'statistical residue' rather than the spirit of a giver locates authorship in provenance — the fact that a person made it — not in any property of the text itself Why doesn't AI output carry the spirit of a giver?. On that view a statistically rare AI passage and a statistically common human one could land on opposite sides of the legal line from where a rarity metric would put them, because what's protected is the relationship, not the feature vector. The related claim that AI output is structurally hearsay — unattributable at the origin — pushes the same way: the thing legal tools are built to track is the chain back to a source, which rarity discards Does AI-generated knowledge have the same structure as hearsay?.

So: rarity correlates with originality well enough to be a genuinely useful detector — especially for telling tightly-clustered machine output from the wider spread of human work — but it conflates creativity with distributional weirdness, rewards novelty unconstrained by value, and is blind to the provenance that may be what law actually protects. The interesting takeaway is that the best published proxy and the strongest critique of proxies live in the same collection, and they don't contradict so much as mark the boundary of what any single statistic can carry.

Sources 8 notes

Can statistical rarity measure whether stories are truly original?

StoryScope operationalizes originality as statistical rarity in discourse-level narrative decisions. Human stories are measurably rarer in this space than AI outputs, which cluster tightly, offering a quantifiable proxy for the human conception copyright law requires.

Does AI homogenize culture the way mass media did?

AI mass-generates similar flows disguised as personalized outputs, suppressing novelty more deeply than pre-stamped commodities because contextual customization makes homogeneity invisible to individual users. Evidence: independent LLMs converge on similar outputs despite nominal competition.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Should RAG systems use model confidence or data rarity to trigger retrieval?

Model confidence and data-rarity signals catch orthogonal failure modes: confidence misses hallucinations about rare entities, while rarity misses uncertain reasoning about common knowledge. Hybrid triggers substantially outperform either signal alone.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Why doesn't AI output carry the spirit of a giver?

AI-generated content lacks hau—the spiritual essence that binds gift economies—because no person gave it. This absence is more fundamental than alienation: the output was never anyone's to begin with, so no relationship of obligation forms.

Does AI-generated knowledge have the same structure as hearsay?

AI output shares all defining features of hearsay: testimony at remove, modification in retelling, unattributable origin, and unverifiability against stable sources. This means Enlightenment verification tools—citation, archiving, peer review, evidentiary chains—cannot process AI output by design.

Does statistical rarity actually correlate with originality that law should protect?

Sources 8 notes

Next inquiring lines