I’m reading the new Automated Content Access Protocol. It’s a proposed protocol by which content providers will tell Google how to do its job. I’m optimistic about the idea, but this spec is … strange.
The problem ACAP is trying to solve is that some people put stuff online that they don’t want search engines to pay attention to. There are a lot of good reasons to do this:
Telling search engines not to follow links in your blog comments is one way of keeping link-spamming scoundrels at bay.
Some search engine robots are bad houseguests; they put their muddy boots up on the couches and issue way too many requests at a time.
Like bread, some content gets stale, and you want visitors to the site to look at a fresher version.
The content is semi-private; not worth password protecting, but not really meant for everyone to trawl through, either.
Robots are often stupid, and could use unambiguous signs telling them to come in through the door, not the skylight.
There are also some more nakedly economic reasons, not all of which are bad:
You want people to know about your stuff so that they come to your site and look at your ads. People who look at cached copies on the search engine side don’t see the ads.
You provide recent content (like news) for free. After a while, the content goes behind a paywall. Once it does, search engines keep clear!
Search engines make truckloads of money, and as a hardworking content provider, you’d like a cut of the action.
Heretofore, the main protocols for telling search engines what to do and what not to do — robots.txt and the Robots META HTML tag — haven’t been particularly fine-grained. The search engine companies have been allowing providers to create quasi-proprietary “sitemaps,” but those tend to be optimized towards telling the search engine in great detail what it should look for, rather than telling it where not to look. The result is that content providers, in general, have faced a tradeoff between opting out of the search economy entirely (a sure ticket to oblivion in this day and age) or opting in on terms they’re dissatisfied with. Some, like Copiepresse have thrown conniption fits about it, but others have actually done something about the weather and come up with ACAP.
Robots.txt was a fascinating example of decentralized decision-making; no one requires sites to use it or search engines to respect it, but everyone does, pretty much as a matter of course. (Mine, for example, is as simple and generous as possible.) There are hints, here and there, that robots.txt might be so universal that it could ripen into something legally significant. Restrictions not expressed in it would be unenforceable; restrictions that are expressed in it would be mandatory. ACAP is an attempt to create that sort of universal regime all at once. Fiat ACAP, fiat lex.
ACAP, therefore, is two things. First, it’s a dialect for search engines and websites to communicate, so that everyone can Just Get Along (TM). Second, it’s an implicit threat. Play along on our terms or else.
Hence my interest.
I’m realizing that this thing wasn’t written by web people.
Take 2.5.3.1, which allows sites to express time limits for how long a document may be indexed. The time limits are expressed in days. Days. How hard would it have been to add hours and minutes? To use UTC times? To express what time zone a given date refers to? Not hard at all. But no one did, which says to me that the working group wasn’t pushed very hard by people who really design Internet software for a living.
Then, 2.5.3.2 defines a syntax that “will generally only be interpreted by prior arrangement.” This makes no sense. If a part of your protocol has semantics such that the people using the protocol need to strike a side bargain that sets out when to use it, it shouldn’t be in the core syntax. Either work out the issues (here, unspecified “security” concerns), or kick the syntax out to an extension. This kind of indecisive hedging violates Postel’s robustness principle to “Be liberal in what you accept; be conservative in what you send.”
In general, I don’t get the sense that ACAP was carefully drafted to think through the issues of which statements should be expressed as permissions and which as restrictions. There are notes in the spec using phrases you never want to see in a spec, such as “it is not clear,” debating whether X or Y feature should be a condition attached to an allow, a prohibit, or something else entirely. I like the transparency for a work in progress, but still, if these issues are that unresolved, this is a discussion draft, not an “implementation version.”
The lack of examples in the spec is annoying; the naming is worse. Take:
The basic usage type
present
enables expression of a general permission or prohibition to present a particular resource in any way. “Present” is a bad word to use as a verb in this kind of a spec; it’s too easy to confuse with the adjective that means “missing.” “Display” might have been better (although they do not the problem of “display”ing nonvisual material; “return” is plausible; even “represent” might be clearer in this context.Also … What we have here is a failure to generalize. An engine can be told that it can or cannot present the following versions of a resource:
original
currentcopy
oldcopy
snippet
extract
thumbnail
oldsnippet
extract
(again)thumbnail
(again, with a different definition)link
This is such a confused list, I don’t know where to begin. The list mixes up age, summarization, and media types. Which of
snippet
,extract
, andthumbnail
applies to video content? Bakingold
andcurrent
, into the different versions, rather than just making them additional modifiers, causes needless confusion. It’s actually reasonably easy to reformat this list along orthogonal axes; I hope that ACAP does just that before anyone wastes too much effort trying to implement this chaotic mishmash.max-length
is defined only for textual content. Max-size might have made sense for thumbnails, y’know?In my world, if your spec repeatedly contains the words, “This … feature … is not yet fully ready for implementation,” you are not permitted to claim, “The ACAP Pilot Project has created the v1.0 of the Automated Content Access Protocol in time, on budget and with unprecedented collaboration and support from the industry.”
“Although in theory the usage type
other
[i.e. the default fall-through case] could be used in a permission field,t here is no known use case for doing so.” Garbage. If I used ACAP on my site, I’d use ACAP-allow-other to signal an approval of whatever innovative things that search engines might come up with to make my site easier to use and more accessible. But leave it to a publisher-funded group to believe that there’s no “use case” for embracing the unknown or for voluntarily sharing without restriction.
I admire the working group’s attempts to specify ACAP in a way that will minimize conflicts with crawlers that only understand robots.txt-speak. Personally, I would recommend that anyone writing an ACAP file use ACAP-permissions-reference
in their robots.txt to send all crawlers that speak ACAP to an external file that consists of pure ACAP. That way, each kind of crawler reads only the file format it speaks natively, minimizing conflicts.
All in all, it’s an interesting start. I’m concerned that the publishers will soon argue that failure to respect every last detail expressed in an ACAP file will constitute automatic copyright infringement, breach of contract, trespass to computer systems, a violation of the Computer Fraud and Abuse Act (and related state statutes), trespass vi et armis, highway robbery, land-piracy, misappropriation, alienation of affection, and/or manslaughter. But still, that argument isn’t ACAP’s fault, and as a language for clearing the channels of publisher-search engine communication, it has substantial promise.