The Laboratorium : Automated Content Access Problems

I’m reading the new Automated Content Access Protocol. It’s a proposed protocol by which content providers will tell Google how to do its job. I’m optimistic about the idea, but this spec is … strange.

The problem ACAP is trying to solve is that some people put stuff online that they don’t want search engines to pay attention to. There are a lot of good reasons to do this:

Telling search engines not to follow links in your blog comments is one way of keeping link-spamming scoundrels at bay.
Some search engine robots are bad houseguests; they put their muddy boots up on the couches and issue way too many requests at a time.
Like bread, some content gets stale, and you want visitors to the site to look at a fresher version.
The content is semi-private; not worth password protecting, but not really meant for everyone to trawl through, either.
Robots are often stupid, and could use unambiguous signs telling them to come in through the door, not the skylight.

There are also some more nakedly economic reasons, not all of which are bad:

You want people to know about your stuff so that they come to your site and look at your ads. People who look at cached copies on the search engine side don’t see the ads.
You provide recent content (like news) for free. After a while, the content goes behind a paywall. Once it does, search engines keep clear!
Search engines make truckloads of money, and as a hardworking content provider, you’d like a cut of the action.

Heretofore, the main protocols for telling search engines what to do and what not to do — robots.txt and the Robots META HTML tag — haven’t been particularly fine-grained. The search engine companies have been allowing providers to create quasi-proprietary “sitemaps,” but those tend to be optimized towards telling the search engine in great detail what it should look for, rather than telling it where not to look. The result is that content providers, in general, have faced a tradeoff between opting out of the search economy entirely (a sure ticket to oblivion in this day and age) or opting in on terms they’re dissatisfied with. Some, like Copiepresse have thrown conniption fits about it, but others have actually done something about the weather and come up with ACAP.

Robots.txt was a fascinating example of decentralized decision-making; no one requires sites to use it or search engines to respect it, but everyone does, pretty much as a matter of course. (Mine, for example, is as simple and generous as possible.) There are hints, here and there, that robots.txt might be so universal that it could ripen into something legally significant. Restrictions not expressed in it would be unenforceable; restrictions that are expressed in it would be mandatory. ACAP is an attempt to create that sort of universal regime all at once. Fiat ACAP, fiat lex.

ACAP, therefore, is two things. First, it’s a dialect for search engines and websites to communicate, so that everyone can Just Get Along (TM). Second, it’s an implicit threat. Play along on our terms or else.

Hence my interest.

I’m realizing that this thing wasn’t written by web people.

Take 2.5.3.1, which allows sites to express time limits for how long a document may be indexed. The time limits are expressed in days. Days. How hard would it have been to add hours and minutes? To use UTC times? To express what time zone a given date refers to? Not hard at all. But no one did, which says to me that the working group wasn’t pushed very hard by people who really design Internet software for a living.
Then, 2.5.3.2 defines a syntax that “will generally only be interpreted by prior arrangement.” This makes no sense. If a part of your protocol has semantics such that the people using the protocol need to strike a side bargain that sets out when to use it, it shouldn’t be in the core syntax. Either work out the issues (here, unspecified “security” concerns), or kick the syntax out to an extension. This kind of indecisive hedging violates Postel’s robustness principle to “Be liberal in what you accept; be conservative in what you send.”
In general, I don’t get the sense that ACAP was carefully drafted to think through the issues of which statements should be expressed as permissions and which as restrictions. There are notes in the spec using phrases you never want to see in a spec, such as “it is not clear,” debating whether X or Y feature should be a condition attached to an allow, a prohibit, or something else entirely. I like the transparency for a work in progress, but still, if these issues are that unresolved, this is a discussion draft, not an “implementation version.”
The lack of examples in the spec is annoying; the naming is worse. Take:

The basic usage type present enables expression of a general permission or prohibition to present a particular resource in any way. “Present” is a bad word to use as a verb in this kind of a spec; it’s too easy to confuse with the adjective that means “missing.” “Display” might have been better (although they do not the problem of “display”ing nonvisual material; “return” is plausible; even “represent” might be clearer in this context.
Also … What we have here is a failure to generalize. An engine can be told that it can or cannot present the following versions of a resource:
- original
- currentcopy
- oldcopy
- snippet
- extract
- thumbnail
- oldsnippet
- extract (again)
- thumbnail (again, with a different definition)
- link
This is such a confused list, I don’t know where to begin. The list mixes up age, summarization, and media types. Which of snippet, extract, and thumbnail applies to video content? Baking old and current, into the different versions, rather than just making them additional modifiers, causes needless confusion. It’s actually reasonably easy to reformat this list along orthogonal axes; I hope that ACAP does just that before anyone wastes too much effort trying to implement this chaotic mishmash.
max-length is defined only for textual content. Max-size might have made sense for thumbnails, y’know?
In my world, if your spec repeatedly contains the words, “This … feature … is not yet fully ready for implementation,” you are not permitted to claim, “The ACAP Pilot Project has created the v1.0 of the Automated Content Access Protocol in time, on budget and with unprecedented collaboration and support from the industry.”
“Although in theory the usage type other [i.e. the default fall-through case] could be used in a permission field,t here is no known use case for doing so.” Garbage. If I used ACAP on my site, I’d use ACAP-allow-other to signal an approval of whatever innovative things that search engines might come up with to make my site easier to use and more accessible. But leave it to a publisher-funded group to believe that there’s no “use case” for embracing the unknown or for voluntarily sharing without restriction.

I admire the working group’s attempts to specify ACAP in a way that will minimize conflicts with crawlers that only understand robots.txt-speak. Personally, I would recommend that anyone writing an ACAP file use ACAP-permissions-reference in their robots.txt to send all crawlers that speak ACAP to an external file that consists of pure ACAP. That way, each kind of crawler reads only the file format it speaks natively, minimizing conflicts.

All in all, it’s an interesting start. I’m concerned that the publishers will soon argue that failure to respect every last detail expressed in an ACAP file will constitute automatic copyright infringement, breach of contract, trespass to computer systems, a violation of the Computer Fraud and Abuse Act (and related state statutes), trespass vi et armis, highway robbery, land-piracy, misappropriation, alienation of affection, and/or manslaughter. But still, that argument isn’t ACAP’s fault, and as a language for clearing the channels of publisher-search engine communication, it has substantial promise.

December 7, 2007 at 2:05 PM

Francis Cave

Thanks for a thoughtful reading of the ACAP proposals. If I may attempt to paraphrase your conclusion, it is that our proposals contain some interesting stuff, but in your view they are somewhat quirky (my interpretation of your word “strange”) and are not in a form that you would consider ready for implementation. As the person primarily responsible for drafting these proposals, I’m obviously keen to understand how we can make them more ready for implementation, and your comments raise some important issues.

It’s unfortunate that the fact that we are calling these proposals “ACAP Version 1.0” is in some quarters being interpreted as some kind of a “take it or leave it” ultimatum. Nothing could be further from the truth. We recognise that these are first public proposals for how publishers could more clearly communicate policies for access and use of what they choose to publish online, and no doubt there are many ways in which these proposals could be improved.

Francis Cave ACAP Technical Project Manager Regarding the overall “strangeness” of these proposals, I believe that this derives in large measure from the fact that the conversation that we have tried to have with search engines during their preparation has been lop-sided. Because the major search engines have not been able, for whatever reason, to engage formally in the ACAP project, much of the input to our work has been from publishers, and this is undoubtedly reflected in many ways - large and small - within the proposals. We would have liked more input from potential implementors on the receiving side, and this has obviously had an impact upon the style and substance of our proposals.

Our original intention was not to use REP at all. Some in the publishing community took - and perhaps still take - the view that REP is an inappropriate technology for communicating policies in terns that make sense to content owners. REP is essentially permissive by default, whereas the basis on which publishers generally license use of content reflects the law, which is that permitted uses are specified by the publisher. Our eventual decision to propose extensions to REP was based upon the blindingly obvious fact that REP is the established way for content owners to communicate routinely with crawler operators, and it will be far easier for crawler operators to implement extensions to what they are already able to interpret in REP than to propose an entirely new protocol.

Since we decided to propose to use the REP methodology to deliver access and usage permissions, most of the discussion has been with publishers about what they want to be able to communicate and, arguably, too little � but by no means none - of the discussion has been with implementors about how best to express this in technical terms. That is not ideal, and has resulted in a set of proposals that quite understandably make more sense to those involved in their development than to those who weren’t involved. ACAP has always been an open project to implementors on the receiving side as well as publishers, and our invitiation for them to participate remains open. We would greatly welcome their increased input.

Regarding the detailed points you make about the content of the specification, some of these are down to decisions that I personally took as to what should be in the public version of our proposals and what to leave out. I had to try to strike a balance between at one extreme issuing a set of proposals that only contain stuff that we are entirely confident about and at the other extreme giving a full picture of all the features that we’d like in there but which haven’t been fully specified and tested yet. I obviously got the balance wrong from your point of view, so sorry about that. But here are some detailed responses:

Dates, times and durations - Yes, I’m sure that we have to be more precise about how these should be expressed and interpreted. I think just adopting standard ISO formats for date-times isn’t a total solution. More input from prospective recipients would help us to ensure that what is expressed clearly and precisely also has a clear and precise interpretation in terms of the intended behaviour.
Interpretations by prior arrangement - I think this is a matter of opinion. I believe that we have to define how to express things that publishers wish to say but which search engines, for understandable reasons, won’t in the general case be willing to act upon, but may be prepared to in specific cases. There are several examples in our proposals of forms of expression that search engines, unless they make a special arrangement with the content owner, would be bound to treat as “cloaking”. Maybe implementors on the receiving side would like us to divide our proposals more clearly into a core set of features that don’t involve such issues and a supplementary set that might. From a publisher perspective a number of what you might see as “non-core” features are quite fundamental to what they need to be able to communicate, so at this stage I don’t think it would be helpful to create such a separation.
“Work in progress” indications in the specification - I think this all boils down to a confusion as to whether or not our proposals are supposed to be a finished specification. They’re not, and I freely admit that the “Version 1.0” label, which indicated that this is the output of our pilot phase, has probably not helped us to make that fact clear, especially among web developers who are used to “Version 1.0” being a much more complete and polished thing. I regret that confusion.
Lack of examples - Yes, you’re right, we need to include more examples.
Used resource types - You’re right that these combine a number of things, but in the end what the publisher wants to say is whether or not you can “present” (you may believe this a bad choice, but oh! what interminable debate we’ve already had about the right term to use!) something, and these are an attempt to list the things that get presented in search results: snippets, thumbnails, links, current and past copies that have been stored (“preserved”) in the search engine cache, and sometimes the original page retrieved in real-time from the content owner’s site. By the way, we’ve dropped “extract” from the current list in the corrected set of proposals that are now available from the ACAP website (the second “extract” should have been “oldextract” - a typo - but that’s gone too now). If you exclude “extract”, do you still find the list confusing? Yes, we fully understand that these things get produced/retrieved in a variety of different ways for presentation to the end-user, and in anything other than REP I’d propose a very different way of specifying them. We’ve tried to keep the syntax simple. You can of course generalize and just use “present” without the specialization for a particular used resource type, in which case the permission/prohibition applies to all forms in which a resource might be presented.
Max-length, max-size - One of the items on our immediate “to do” list is to review what needs to be added to support use cases involving images and other non-text resources. We have so far deliberately focussed on permissions relating primarily to text resources.
Version 1.0 - Yes, as I’ve already agreed, this means different things in different worlds, and I regret any confusion that this has caused about the status of the proposals that we have now made public.
Other - This clearly depends upon your point of view. If you’re a publisher and you have a contractual obligation to your authors and other contributors not to allow their content to be used for purposes other than those that are agreed to be permitted, you need to be able to communicate a blanket prohibition. That’s the only sensible application of this usage type that we’ve identified. You don’t need to express a blanket permission in REP, because REP is fundamentally permissive by default, so anything you don’t explicitly prohibit can be assumed to be permitted. On that basis, if you say nothing, that will convey exactly the same as “ACAP-allow-other”, so it would be a redundant expression. By the way, I fundamentally disagree with you about the advisability of letting large corporations do what they like with your data, but that’s another matter.

Francis Cave, ACAP Technical Project Manager

March 16, 2008 at 4:58 AM

marko

Francis Cave says, “It�s unfortunate that the fact that we are calling these proposals �ACAP Version 1.0� is in some quarters being interpreted as some kind of a �take it or leave it� ultimatum. Nothing could be further from the truth. We recognise that these are first public proposal….”

But the ACAP Web site is urging publishers to modify their robots.txt files using this version 1.0. If the protocol is as buggy as James is saying, shouldn’t you discourage its use just yet?