views:

1993

answers:

8

I'm interested in exposing a direct REST interface to collections of JSON documents (think CouchDB or Persevere). The problem I'm running into is how to handle the GET operation on the collection root if the collection is large.

As an example pretend I'm exposing StackOverflow's Questions table where each row is exposed as a document (not that there necessarily is such a table, just a concrete example of a sizable collection of 'documents'). The collection would be made available at /db/questions with the usual CRUD api GET /db/questions/XXX, PUT /db/questions/XXX, POST /db/questions is in play. The standard way to get the entire collection is to GET /db/questions but if that naively dumps each row as a JSON object, you'll get a rather sizeable download and a lot of work on the part of the server.

The solution is, of course, paging. Dojo has solved this problem in its JsonRestStore via a clever RFC2616-compliant extension of using the Range header with a custom range unit items. The result is a 206 Partial Content that returns only the requested range. The advantage of this approach over a query parameter is that it leaves the query string for...queries (e.g. GET /db/questions/?score>200 or somesuch, and yes that'd be encoded %3E).

This approach completely covers the behavior I want. The problem is that RFC 2616 specifies that on a 206 response (emphasis mine):

The request MUST have included a Range header field (section 14.35) indicating the desired range, and MAY have included an If-Range header field (section 14.27) to make the request conditional.

This makes sense in the context of the standard usage of the header but is a problem because I'd like the 206 response to be the default to handle naive clients/random people exploring.

I've gone over the RFC in detail looking for a solution but have been unhappy with my solutions and am interested in SO's take on the problem.

Ideas I've had:

  • Return 200 with a Content-Range header! - I don't think that this is wrong, but I'd prefer if a more obvious indicator that the response is only Partial Content.
  • Return 400 Range Required - There is not a special 400 response code for required headers, so the default error has to be used and read by hand. This also makes exploration via web browser (or some other client like Resty) more difficult.
  • Use a query parameter - The standard approach, but I'm hoping to allow queries a la Persevere and this cuts into the query namespace.
  • Just return 206! - I think most clients wouldn't freak out, but I'd rather not go against a MUST in the RFC
  • Extend the spec! Return 266 Partial Content - Behaves exactly like 206 but is in response to a request that MUST NOT contain the Range header. I figure that 266 is high enough that I shouldn't run into collision issues and it makes sense to me but I'm not clear on whether this is considered taboo or not.

I'd think this is a fairly common problem and I'd like to see this done in a sort of de facto fashion so I or someone else isn't reinventing the wheel.

What's the best way to expose a full collection via HTTP when the collection is large?

+7  A: 

My gut feeling is that the HTTP range extensions aren't designed for your use case, and thus you shouldn't try. A partial response implies 206, and 206 must only be sent if the client asked for it.

You may want to consider a different approach, such as the one use in Atom (where the representation by design may be partial, and is returned with a status 200, and potentially paging links). See RFC 4287 and RFC 5005.

Julian Reschke
The Dojo usage is completely within spec. If the server doesn't understand the `items` range unit, it returns a full response.I'm familiar with Atom but that's not the general solution to Rest paging. This isn't a solution for a single case, more of what the general solution should be. Not all documents/collections fit the Atom model and there's no reason to force it unless required.
Karl Guertin
+2  A: 

You might consider using a model something like the Atom Feed Protocol since it has a sane HTTP model of collections and how to manipulate them (where insane means WebDAV).

There's the Atom Publishing Protocol which defines the collection model and REST operations plus you can use RFC 5005 - Feed Paging and Archiving to page through big collections.

Switching from Atom XML to JSON content should not affect the idea.

dajobe
+2  A: 

Edit:

After thinking about it a bit more, I'm inclined to agree that Range headers aren't appropriate for pagination. The logic being, the Range header is intended for the server's response, not the applications. If you served 100 megabytes of results, but the server (or client) could only process 1 megabyte at a time, well, thats what the Range header is for.

I'm also of the opinion that a subset of resources is its own resource (similar to relational algebra.), so it deserve representation in the URL.

So basically, I recant my original answer (below) about using a header.


I think you answered your own question, more or less - return 200 or 206 with content-range and optionally use a query parameter. I would sniff the user agent and content type and, depending on those, check for a query parameter. Otherwise, require the range headers.

You essentially have conflicting goals - let people use their browser to explore (which doesn't easily allow custom headers), or force people to use a special client that can set headers (which doesn't let them explore).

You could just provide them with the special client depending on the request - if it looks like a plain browser, send down a small ajax app that renders the page and sets the necessary headers.

Of course, there is also the debate about whether the URL should contain all the necessary state for this sort of thing. Specifying the range using headers can be considered "un-restful" by some.

As an aside, it would be nice if servers could respond with a "Can-Specify: Header1, header2" header, and web browsers would present a UI so users could fill in values, if they desired.

Richard Levasseur
Thanks for the response. I've thought about the topic, but was hoping to get a second opinion. Happen to have a pointer for the header arguments?
Karl Guertin
Here's the only one i have bookmarked (see the discussion in the comments): http://barelyenough.org/blog/2008/05/versioning-rest-web-services/Another site revolved around Ruby's usage of .json, .xml, .whatever in determining the content type of a request. Some of the examples:* language - putting it in the URL means sending the link to another country would render it in the wrong language.* pagination - Putting it in the header means you can't link people to what you see
Richard Levasseur
* content-type: a combination of language and pagination problems - if its in the url, what if the client doesn't support that content type (eg, a .ajax and a .html extension)? Conversely, without that content-type in the url, you can't ensure the same representation is given. "new ajax site! example.com/cool.ajax" vs "cool article here: example.com/article.ajax#id=123".
Richard Levasseur
IMO, whether it goes in the URL or not depends on what it is. My general rule is, if it would identify a concrete resource (be it a resource in a specific state, selection of resources, or discrete result), it goes in the URL. Search queries, pagination, and restful transactions are good examples of this. If its something that is needed to transform the abstract representation to a concrete representation, it goes in the header. auth info and content-type are good examples of this.
Richard Levasseur
+1  A: 

You can detect the Range header, and mimic Dojo if it is present, and mimic Atom if it is not. It seems to me that this neatly divides the use cases. If you are responding to a REST query from your application, you expect it to be formatted with a Range header. If you are responding to a casual browser, then if you return paging links it will let the tool provide an easy way to explore the collection.

Greg
+2  A: 

If there is more than one page of responses, and you don't want to offer the whole collection at once, does that mean there are multiple choices?

On a request to /db/questions, return 300 Multiple Choices with Link headers that specify how to get to each page as well as a JSON object or HTML page with a list of URLs.

Link: <>; rel="http://paged.collection.example/relation/paged"
Link: <>; rel="http://paged.collection.example/relation/paged"
...

You'd have one Link header for each page of results (an empty string means the current URL, and the URL is the same for each page, just accessed with different ranges), and the relationship is defined as a custom one per the upcoming Link spec. This relationship would explain your custom 266, or your violation of 206. These headers are your machine-readable version, since all of your examples require an understanding client anyway.

(If you stick with the "range" route, I believe your own 2xx return code, as you described it, would be the best behavior here. You're expected to do this for your applications and such ["HTTP status codes are extensible."], and you have good reasons.)

300 Multiple Choices says you SHOULD also provide a body with a way for the user agent to pick. If your client is understanding, it should use the Link headers. If it's a user manually browsing, perhaps an HTML page with links to a special "paged" root resource that can handle rendering that particular page based on the URL? /humanpage/1/db/questions or something hideous like that?


The comments on Richard Levasseur's post remind me of an additional option: the Accept header (section 14.1). Back when the oEmbed spec came out, I wondered why it hadn't been done entirely using HTTP, and wrote up an alternative using them.

Keep the 300 Multiple Choices, the Link headers and the HTML page for an initial naive HTTP GET, but rather than use ranges, have your new paging relationship define the use of the Accept header. Your subsequent HTTP request might look like this:

GET /db/questions HTTP/1.1
Host: paged.collection.example
Accept: application/json;PagingSpec=1.0;page=1

The Accept header allows you to define an acceptable content type (your JSON return), plus extensible parameters for that type (your page number). Riffing on my notes from my oEmbed writeup (can't link to it here, I'll list it in my profile), you could be very explicit and provide a spec/relation version here in case you need to redefine what the page parameter means in the future.

Vitorio
+1 link headers, but I'd also recommend the common first, prev, next, last rels, as well as RFC5005's prev-archive, next-archive, and current.
Joseph Holsten
+3  A: 

I think the real problem here is that there is nothing in the spec that tells us how to do automatic redirects when faced with 413 - Requested Entity Too Large.

I was struggling with this same problem recently and I looked for inspiration in the RESTful Web Services book. Personally I don't think 206 is appropriate due to the header requirement. My thoughts also led me to 300, but I thought that was more for different mime-types, so I looked up what Richardson and Ruby had to say on the subject in Appendix B, page 377. They suggest that the server just pick the preferred representation and send it back with a 200, basically ignoring the notion that it should be a 300.

That also jibes with the notion of links to next resources that we have from atom. The solution I implemented was to add "next" and "previous" keys to the json map I was sending back and be done with it.

Later on I started thinking maybe the thing to do is send a 307 - Temporary Redirect to a link that would be something like /db/questions/1,25 - that leaves the original URI as the canonical resource name, but it gets you to an appropriately named subordinate resource. This is behavior I'd like to see out of a 413, but 307 seems a good compromise. Haven't actually tried this in code yet though. What would be even better is for the redirect to redirect to a URL containing the actual IDs of the most recently asked questions. For example if each question has an integer ID, and there are 100 questions in the system and you want to show the ten most recent, requests to /db/questions should be 307'd to /db/questions/100,91

This is a very good question, thanks for asking it. You confirmed for me that I'm not nuts for having spent days thinking about it.

stinkymatt
A: 

Seems to me that the best way to do this is to include range as query parameters. e.g., GET /db/questions/?date>mindate&date<maxdate. Upon a GET to the /db/questions/ with no query parameters, return 303 with Location: /db/questions/?query-parameters-to-retrieve-the-default-page. Then provide a different URL by which whomever is consuming your API to get statistics about the collection (e.g., what query parameters to use if s/he wants the entire collection);

Dathan
+2  A: 

I don't really agree with some of you guys. I've been working for weeks on this features for my REST service. What I finally did is really simple.

Client MUST include an "Accept-Ranges" header to avoid receiving a 416 error (Request Entity Too large).

Server sends a 206 (Partial Content), with the Content-Range header with a custom range unit specifying which part of the resource has been sent, and an ETag header to identify the current version of the resource.

To request a specific part of the resource, the client MUST use "Accept-Ranges", "Range" and "If-Modified-Since" headers. Using the date in the "If-Modified-Since" header, the server can verify that the resource hasn't changed since the last request. The request can then be normally processed. If it has, a 412 (Precondition failed) error is sent because the rank of each items might have been modified, the whole process has been corrupt.

I use ETag/If-Match in tandem with Last-Modified/If-Modified-Since to optimize cache. Browsers and proxies might rely on one or both of them for their caching algorithms.

I think that URL should be clean unless it's to include a search/filter query.

MetalUpYourAss