A Gate With No Fence: Why Robots.txt Was Never Going to Work

The honest case for a DNS-based AI crawl protocol — and a clear-eyed admission of what it cannot stop

Dibblee Industries Artificial Intelligence, Web Standards, Content Policy February 2026 Bob Dibblee

This post reflects our experience as a website operator and our attempt to propose a better standard. The PAQ protocol referenced here is an exploratory draft published at github.com/Dibblee/paq-spec. It is not a ratified standard.

The Gate With No Fence

Robots.txt is a text file. It contains instructions in plain English. It has no enforcement mechanism of any kind. It works because the companies that read it have decided, for their own reasons, to comply. A crawler that ignores it faces no technical barrier, no legal penalty, and in most jurisdictions no consequence whatsoever.

The fence is not there. It has never been there. What exists is a gate standing in an open field, and the assumption that anyone approaching it will choose to use it.

This is not a flaw that can be patched. It is the design. Robots.txt was written in 1994 to coordinate between webmasters and the handful of search engine crawlers that existed at the time. All parties were operating in good faith and had aligned incentives. That world no longer exists.

Why Compliance Is Selective

The major US AI companies — Anthropic, OpenAI, Amazon, Google — generally honour robots.txt. Not because they are legally required to. Because they have reputational exposure in the US market, legal teams who have evaluated the litigation risk, and PR departments that would prefer not to generate headlines about ignoring stated operator preferences.

ByteDance made 12,200 requests to this site in thirty days. ByteDance does not have the same reputational calculus in the North American market. Chinese crawlers, offshore scraper farms, and anyone operating outside the cultural and legal gravity of Silicon Valley have no particular reason to check a text file before taking what they want.

The Honest Assessment

Robots.txt compliance is a social contract enforced by reputational pressure on companies that care about their US standing. It is not a technical control, not a legal instrument, and not effective against anyone who has decided it does not apply to them. It stops the people who were probably going to be reasonable anyway.

This matters because the framing around robots.txt — and by extension most AI crawl policy discussion — treats compliance as a solved problem that just needs to be extended to new user-agent strings. It is not a solved problem. The enforcement mechanism is absent. Adding more lines to a text file does not change that.

PAQ: A Different Contract

We published a draft protocol called PAQ — Public AI Query Protocol. The core idea is to move the policy signal from an HTTP-layer text file to a DNS TXT record, and to pair that signal with a structured query endpoint that gives compliant agents a reason to use it rather than crawl.

DNS TXT Record

_paq.example.com IN TXT "v=PAQ1; c=throttled; e=https://example.com/.well-known/paq"

DNS is pre-HTTP. An agent can resolve _paq records for an entire crawl queue before making a single HTTP request. The signal lives at a layer the site's content cannot spoof — the same model that SPF and DMARC use for email authentication. Policy changes propagate on the record's TTL. No HTTP request required to discover it.

The endpoint answers three commands: DESCRIBE (what is here), GET (retrieve a specific resource by ID), and SEARCH (find resources by keyword, returning IDs only). There is no pagination. There is no batch export. An agent that wants everything cannot get it through PAQ. The manifest is curated by the operator. That is by design.

The spec is published on GitHub under MIT license. It is an exploratory draft, not a ratified standard.

Draft Spec github.com/Dibblee/paq-spec →

What PAQ Can and Cannot Do

Can

Give compliant agents structured, curated data — better than scraped HTML
Make policy checkable before any HTTP request via DNS
Create a standard that could eventually carry legal weight in GDPR jurisdictions where bulk data extraction without consent is a live question
Make non-compliance visible and measurable — you can log who checked the record and who did not
Give AI labs an incentive to implement it: better signal-to-noise from curated endpoints versus guessing at HTML structure

Cannot

Stop ByteDance, offshore scraper farms, or anyone who has decided the rules do not apply to them
Enforce anything without adoption by major AI labs
Replace infrastructure-level blocking for bad-faith actors
Guarantee the endpoint won't be bulk-queried by a conforming agent that decides enumeration is acceptable
Exist as a standard without the same adoption problem that every web standard faces

The Two-Layer Answer

The honest conclusion is that no single mechanism solves this. The problem has two distinct populations: actors who will comply with a stated policy given a reasonable incentive, and actors who will not comply regardless.

For the second group, the answer is infrastructure. Cloudflare WAF rules, bot fingerprinting, rate limiting, and authentication walls for content you genuinely do not want extracted. These are blunt instruments but they are technical controls, not polite requests.

For the first group — the major US AI labs, the companies with reputational and legal exposure — a protocol like PAQ gives them something better than robots.txt. A structured endpoint they can query cleanly. A DNS record they can check before crawling. A mechanism that lets them demonstrate compliance in a verifiable way. And for operators, a curated manifest means the AI sees what you want it to see, not everything it can reach.

The Stack

Block the bad actors at the infrastructure layer. Set terms for the willing ones at the protocol layer. Neither works without the other. A protocol without enforcement is a gate with no fence. Infrastructure without a protocol is a wall with no door. The goal is both.

We have no illusion that PAQ will solve the structural problem described in the vampire squid post. The economics of AI extraction versus content creation are broken at a scale that a DNS record cannot address. But a clear, simple, implementable standard is a starting point. It is something labs can adopt, operators can deploy in an afternoon, and regulators can eventually point to when the question of AI data obligations becomes a legal matter rather than a voluntary one.

The gate needs a fence. This is a draft specification for the fence. Whether anyone builds it is a different question.

← Blog