The Vampire Squid Effect: AI Is Extracting the Web It Needs to Exist

Home / Blog / The Vampire Squid Effect

This post reflects our direct experience and observations as a website operator. The figures cited are from our own Cloudflare dashboard. This is a starting point for discussion — not legal advice or a formal economic analysis.

35,380 Amazonbot requests (30 days)
0 Visitors returned by Amazon
256 Googlebot requests (same period)
56 Visitors returned by Google

The Numbers Don't Lie

Our Cloudflare dashboard told a story last month. Amazon's crawlers made 35,380 requests to this site. Google made 256. Amazon sent back zero visitors. Google sent 56.

That ratio — 138 times more crawling, zero times the return — is not a bug in Amazon's system. It is the system. It is what AI-era web extraction looks like at scale, and it is happening to every website on the open internet simultaneously.

In 2010, Rolling Stone's Matt Taibbi described Goldman Sachs as "a great vampire squid wrapped around the face of humanity, relentlessly jamming its blood funnel into anything that smells like money." He was describing an institution that had positioned itself at every point where value flowed, extracting without producing an equivalent return. The metaphor was striking because it was structurally accurate.

It describes the AI industry's relationship with the web just as accurately today.

The Bargain That Built the Web

The internet that AI companies are currently strip-mining was built on a deal. Nobody signed it. There was no contract. But it was real, and everyone who published anything online understood the terms.

Google would crawl your site. It would index your content. It would make your pages findable when people searched for what you'd written. In exchange, it would send you traffic. That traffic was the economic engine that made the open web viable — it drove ad revenue, generated leads, built audiences, funded journalism, and gave millions of creators a reason to keep creating things worth reading.

The deal was imperfect. SEO became an industry devoted to gaming it. Content farms exploited it. Google repeatedly adjusted its algorithms to punish manipulation because low-quality content degraded the product it needed to stay useful. But the incentive structure was roughly aligned: Google needed quality content to remain worth using, and content creators needed Google to remain discoverable. Each depended on the other.

The return rate was never generous — maybe 10 to 20 visitors per 100 crawled pages, depending on the site and query competition. But it was a return. Google crawled us 256 times and sent 56 people. That is a 22% referral rate. Not spectacular. But real. A transaction with two parties, both getting something.

The New Model: Take Everything, Return Nothing

AI training operates on a different model entirely. The crawl is vastly larger. The data is ingested, processed, and used to train commercial models that are then sold as products and API services. But the step that used to close the economic loop — sending users back to the source — has been eliminated by design.

When someone asks an AI assistant a question, the answer comes from the model. The user does not visit the site that contained the information the model learned from. There is no referral. There is no traffic. There is no click. There is no revenue for the creator who produced the content that made the answer possible.

Amazon crawled this site 35,380 times. That crawl has economic value to Amazon — those pages are part of the training corpus that makes Alexa, AWS Bedrock, and Amazon's AI services more capable. Amazon will sell access to those capabilities. We received nothing.

The Structure of the Problem

This is not a complaint about one company. The same pattern applies across every major AI provider. The specific numbers vary. The directionality does not. AI companies extract from the web at enormous scale and return nothing of economic value to the people who produced what they extracted. The ratio is not just lower than search — it is zero.

This is not an accident or an oversight waiting to be corrected. It is the architecture of the business model. AI companies benefit from their models being consulted instead of the source. Every time a user gets an answer from an AI rather than clicking through to a website, that is the product working as intended. The disintermediation of the original source is the feature, not the flaw.

What AI Companies Claim to Return

The AI industry has several answers to the extraction critique. None of them hold up.

Claim 1

Citations and attribution

Some AI systems cite sources when answering queries. This is true. It is also irrelevant. A citation in an AI response is not a referral. The user has already received a complete answer. There is no reason to click the link, and most don't. A citation that sends no traffic is brand exposure at best — and sporadic, unreliable brand exposure at that.

Claim 2

Discovery and brand awareness

If an AI names your company while answering a question, perhaps a user will search for you directly. Perhaps. This chain of events is too attenuated to constitute a sustainable economic return on the scale of content extraction that is occurring. "Maybe someone will Google you after we strip-mine your work" is not a value proposition.

Claim 3

Rising tide economics

AI will generate enormous economic value, create new industries, and benefit everyone through productivity gains — including content creators. This argument is structurally identical to every previous "disruptive technology will benefit its victims eventually" claim. The people paying the cost are not the people receiving the benefit, and the timeline for receiving the benefit is always longer than the cost is currently being paid.

The System Is Eating Itself

Beyond the fairness argument, there is a structural problem that the AI industry has not adequately addressed: AI models that extract value from the web need the web to keep producing valuable content. But the web produces valuable content because content creation is economically viable. And content creation is economically viable because of the traffic that search referrals generate.

If AI systematically displaces search — which is explicitly its purpose — it reduces the traffic that makes content creation viable. Reduced traffic means reduced revenue. Reduced revenue means reduced incentive to create. Less creation means less new content. Less new content means AI models trained on progressively staler corpora, or worse, on content generated by other AI systems.

Researchers have a name for what happens when models train on AI-generated content: model collapse. The outputs degrade. The model loses the granularity and specificity of real human knowledge and begins to produce the confident, smooth, slightly wrong kind of text that AI generates when it is predicting language rather than conveying information.

The vampire squid does not just drain the host. It needs the host to survive in order to keep feeding.

The Logical Endpoint

If AI extraction destroys the economics of content creation, the open web contracts. Quality content moves behind paywalls or disappears. The training corpus available to future models shrinks and degrades. The models get worse. Users trust them less. The entire edifice depends on a continuous supply of human-generated content — content that AI is actively making less economically viable to produce.

Specialized Content Pays the Highest Price

Not all content is equally affected. There is an asymmetry that makes the problem worse than the aggregate numbers suggest.

The content that is most valuable for AI training — specialized, domain-specific, expert knowledge — is the hardest to produce and the least likely to have a large general audience. Technical documentation. Procurement guidance. Medical reference material. Legal analysis. Industrial specifications. Forum discussions among practitioners who have spent decades in a specific field.

This content does not have millions of readers. It has hundreds — or dozens — of highly qualified people who need it. It is not generously funded by advertising. It exists because someone decided the effort was worth it. A procurement specialist who spent thirty years navigating defence supply chains and published their hard-won knowledge on the open web did not do so to train Amazon's commercial AI products for free.

AI training data disproportionately values exactly this kind of content. The signal-to-noise ratio is far better than generic web pages. A discussion thread among marine engineers about anchor chain fatigue specifications is worth far more to an industrial AI training pipeline than ten thousand generic articles about anchor chains.

The people who created those discussions received nothing for their contribution to the training corpus. Most of them don't know it was taken.

The Web Is Starting to Fight Back

The equilibrium of mass extraction with zero return is unstable. Several things are already happening to break it.

Legal challenges are underway. The New York Times sued OpenAI over copyright infringement. Authors' guilds have filed class actions. Getty Images has sued multiple AI companies over image training data. These cases will take years to resolve, but they have already forced the AI industry to acknowledge that the question of compensation is not settled by unilateral fiat.

Technical defences are being deployed. Cloudflare now offers AI crawler blocking as a first-class feature, and adoption is growing. Paywalls and authentication requirements are spreading. The open web is beginning to close, not because publishers want to restrict access, but because the economics of remaining open to AI scrapers have become purely negative.

Commercial licensing markets are forming. Reddit negotiated a data licensing deal with Google before its IPO. News organizations are negotiating licensing arrangements rather than simply blocking. A market for training data is emerging — not because AI companies wanted to pay, but because content owners are forcing the issue through a combination of technical barriers and litigation.

None of this will fully resolve the structural problem. But it suggests the current arrangement cannot persist indefinitely.

What We Are Doing About It

We have blocked the major AI training crawlers on this site via Cloudflare. Not because we are philosophically opposed to AI — we are actively building with it. But because there is a meaningful difference between AI tools that we choose to use and that return value to us, and AI training pipelines that take our content without asking and return nothing.

The search bargain was imperfect, but it was a bargain. If you crawl us, you send us readers. AI training as currently practiced is not a bargain. It is extraction. We are under no obligation to participate.

The open web was built by people who published things because the publishing was useful or valuable to them. It was not built as a free training corpus for commercial AI products. The companies consuming it at industrial scale seem to have forgotten that. Some of the people who built it are starting to remind them.