The Habsburg effect: why your data just got more valuable
The European Habsburgs built an empire through strategic marriages, consolidating power by keeping bloodlines strictly within the family. It worked for a while. By the time Charles II of Spain took the throne in 1661, generations of intermarriage had concentrated physical traits so severely the king could barely chew his own food.
Something similar is happening with AI right now.
The recursive trap
Researchers call it model collapse, but the AI community has better names: AI inbreeding, AI cannibalism, Habsburg AI.
Large language models train on text scraped from the internet, billions of documents showing how humans write. Then they generate new text, and people publish it as blog posts, articles, product descriptions, marketing copy. The next generation of models scrapes the web again, and now their training data is a mix of human writing and machine output.
If you asked a model two years ago, before AI content was widespread online, to write a coffee maker description, you’d get: “Brews up to 12 cups, features a programmable timer, includes a reusable filter.” But ask a model trained on AI content to do the same thing, and you start seeing phrases like “elevate your morning ritual” and “seamlessly integrates into your lifestyle.”
These lines appear in thousands of AI descriptions, and the model treats them as the normal way to describe products. Each generation ingests more recycled content, and the pile keeps growing.
How quality degrades
Oxford and Cambridge researchers published a study breaking collapse into two stages, and the first one is insidious because nothing looks wrong.
The model loses rare details first, things like unusual phrases, unexpected viewpoints, and minority opinions, but overall performance seems fine. Benchmarks pass, customers stay happy, and engineers see green lights across the board.
Later, models start confusing concepts and writing gets blander, until everything starts sounding like everyone and no one at the same time. The system learned too much about common patterns and forgot the uncommon ones, and those uncommon bits are exactly what makes language interesting. They shrink with each generation until almost nothing remains.
The contamination is already everywhere
The Internet is already contaminated, and there's no practical way to clean it up.
Since ChatGPT launched in late 2022, machine-written content has flooded the web. Marketing copy, news articles, social posts, product reviews, academic papers, code repositories. All of it now fills the data sources AI companies need for their next models, and filtering it out is close to impossible. Machine-written text regularly fools detection tools, especially when humans make small edits before hitting publish.
Amazon noticed it first when product reviews started sounding identical. Identical sentence structures and suspicious enthusiasm for trivial features replaced the specific details that once made reviews useful.
Training data for new models already contains outputs from older models, and nobody knows how to break the loop.
IBM estimates that by the time an organization notices the collapse, the damage is already too deep to repair. Publicly available human data might even run out between 2026 and 2032, according to Epoch AI.
Pre-2022 data is the new oil
Anything written before November 2022 became a different asset category almost immediately. A finite resource of real human thought, captured before machine outputs polluted everything. Nobody can manufacture more of it.
OpenAI, Anthropic, Google, and the rest are stockpiling old data wherever they can find it, signing exclusive deals with publishers and licensing agreements with archives. Reddit's IPO valuation rested partly on its pre-2022 conversations. News companies that gave away content for years suddenly have leverage they never expected.
Think about who else is sitting on goldmines. Book publishers with decades of backlists, academic journals with centuries of research, newspapers with archives predating the internet. Tech companies spent years taking this data for free, and now scarcity is forcing them to pay, and pay a lot, for what they helped exhaust.
Harvard Law's Journal of Law and Technology published a paper arguing for a "right to uncontaminated human-generated data," suggesting that access to real human expression may need legal protection as machine-written content drowns out the genuine article.
What this means for you, specifically
Step back from the industry view for a second.
AI systems need real human data to stay grounded. Machine-written content can extend what humans create, but it can't replace the source, the same way a photocopy can't replace a painting. Real decisions made when the answer was unclear, real conversations working through disagreement, real mistakes and the fixes that followed. Web scraping misses all of it.
Personal data has always been valuable to advertisers and platforms, but model collapse adds a new dimension. How you think through problems, how you make choices, how you explain yourself differently to different people. AI needs this and can’t generate it on its own. And whoever scrapes your data first gets to use it. Every email through Gmail, every document in Google Docs, every conversation on a Meta platform feeds training sets you have no control over and receive nothing from.
Think about what your records actually contain. Years of emails showing how you work through problems, documents with full edit histories showing every change and every restart, chat logs with the loose thinking that never makes it into polished writing. That's a detailed map of how you think, precisely what AI needs, and it can't be faked.
The same mistake, different century
Habsburgs could have married outside the family while still building alliances, but they wanted total control and paid for it with extinction.
AI companies have the same choice, and most are making the same mistake. Scrape everything, feed machine-written content back into training when human sources dry up, hope nobody notices quality dropping. That direction leads to sameness and decay, where every output sounds like a worse copy of the one before it.
The other direction means treating human data as scarce and worth protecting. Platforms have taken what you create for years while giving you nothing but access to tools that keep getting worse. Their business model depends on access to your data, not on restricting it. If you want control over how your thinking gets used, you have to keep your own records, in your own systems, under your own terms.
Where we go from here
AI companies spent 2023 and 2024 building bigger models with more parameters, longer context windows, and faster inference. The assumption was that scale solves everything, and model collapse proved it wrong.
Quality now matters more than quantity, origin matters more than volume. The real human record matters more than terabytes of machine text converging on the same flat tone.
Habsburgs ran out of options after generations of genetic damage. AI still has time to change course, but only if the people building it acknowledge that the recursive trap has already sprung and that contamination spreads faster than anyone wants to admit.
Your data is what keeps AI connected to human reality. Platforms will keep taking it until you decide to hold onto it yourself.
Related posts
Feb 3, 2026
Household systems that pay for themselves in hours
Mental load means remembering when car insurance renews, tracking which child needs new shoes, knowing the fridge filter expires next month. Every household runs on hundreds of small details that someone must hold in memory.

Sead Fadilpasic
Jan 21, 2026
The superuser problem: why AI agents are 2026's “biggest insider threat”
The primary security issue with AI agents is their access.

Markos Symeonides
Jan 13, 2026
How human control makes AI worth trusting
When you can't see how the machine reached its conclusion, you can't question its logic. You surrender the outcome of your life to a process you don’t understand.

Markos Symeonides




