views:

56

answers:

3

I'm writing a web app similar to wtfimages.com in that one visitor should never (or rarely) see the same thing twice, but different visitors can see the same thing. Ideally, this would span visits, so that when Bob comes back tomorrow he doesn't see today's things again either.

Three first guesses:

  • have enough unique things that it's unlikely any user will draw enough items to repeat
  • actually track each user somehow and log what he has seen
  • have client-side Javascript request things by id according to a pseudorandom sequence seeded with something unique to the visitor and session (e.g., IP and time)

Edit: So the question is, which of these three is the best solution? Is there a better one?

Note: I suspect this question is the web 2.0 equivalent of "how do I implement strcpy?", where everybody worth his salt knows K&R's idiomatic while(*s++ = *t++) ; solution. If that's the case, please point me to the web 2.0 K&R, because this specific question is immaterial. I just wanted a a "join the 21st century" project to learn CGI scripting with Python and AJAX with jQuery.

+2  A: 

The simplest implementation I can think of would be to make a circular linked list, and then start individual users at random offsets in the linked list. You are guaranteed that they will see every image there is to see before they will see any image twice.

Technically, it only needs to be a linked list in a conceptual sense. For example, you could just use the database identifiers of the various items and wrap around once you've hit the last one.


There are complexity problems with other solutions. For example, if you want it to be a different order for each person, that requires permuting the elements in some way. But then you have to store that permutation, so as to guarantee that people see things in different orders. That's going to take up a lot of space. It will also require you to update everybody's permutations if you add or remove an image to the list of things to see, which is yet more work.

A compromise solution that still allows you to guarantee a person sees every image before they see any image twice while still varying things among people might be something like this:

  • Using some hash function H (say, MD5), take the hash of each image, and store the image with a filename equal to the digest (e.g. 194db8c5[...].jpg).

  • Decide on a number N. This will be the number of different paths that a randomly selected person could take to traverse all the images. For example, if you pick N = 10, each person will take one of 10 possible distinct journeys through the images. Don't pick an N larger than the digest size of H (for MD5, this is 16; for SHA-1, it's 64).

  • Make N different permutations of the image list, with the ith such permutation being generated by rotating the characters in each file name i characters to the left, and then sorting all the entries. (For example, a file originally named abcdef with i == 4 will become efabcd. Now sort all the files that have been transformed in this way, and you have a distinct list.)

  • Randomly assign to each user a number r from 0 .. N - 1 inclusive. They now see the images in the ordering specified by r.

Ultimately, this seems like a lot of work. I'd say just suck it up and make it random, accept that people will occasionally see the same image again, and move on.

John Feminella
This will probably be the quick-and-dirty development implementation, but I'd like to move later to the cookie or localstorage strategies because I'd prefer not to have people see things in the same order.
Wang
+1  A: 

Personally I would just store a cookie on the user's machine which holds all the ID's of what he's seen. That way you can keep the 'randomness' and not have to show the items in sequential order as John Feminella's otherwise great solution suggests.

Applying the cookie data in an SQL query would also be trivial: say that you have a comma separated ID's in the cookie, you can just do this (in PHP):

"SELECT image FROM images WHERE id NOT IN(".$_COOKIE['myData'].") ORDER BY RAND() LIMIT 1"

Note that this is just an simple example, you should of course escape the cookie data properly and there might be more efficient ways to select a random entry from a table.

Using a cookie also makes it possible to start off where the user left off the previous time. And cookie sizes won't probably be an issue, you can hold a lot of ID's in 4KB which is (usually) the maximum size of cookie files.

EDIT

If your cookie data looks like this:

$_COOKIE['myData'] == '1,6,19,200,70,16';

You can safely use that data in a SQL query with:

$ids = array_map('mysql_real_escape_string', explode(',', $_COOKIE['myData']));
$query = "SELECT image FROM images WHERE id NOT IN('".implode("', '", $ids)."') ORDER BY RAND() LIMIT 1"

What this will do is that it splits the ID string into individual ID's, then runs mysql_real_escape_string to each of them, then implodes them with quotes so that the query becomes:

$query == "SELECT image FROM images WHERE id NOT IN('1', '6', '19', '200', '70', '16') ORDER BY RAND() LIMIT 1"

So $_COOKIE[] variables are just like any other variable, and you must do same precautions for them as with other data.

Tatu Ulmanen
... and cookies are sent through headers on **every** HTTP transaction... hello overhead.
jldupont
also note the potential SQL injection attack vector in the solution.
jldupont
@jldupont, that's why I said 'should of course escape the cookie data properly'. And given all the other data that moves through **every** HTTP transaction, I don't think one cookie will make much difference?
Tatu Ulmanen
jldupont
Since I've never worked with cookies before, could you recommend any good guide to "escaping the cookie data properly"? I understand what an SQL injection is; I've never had to defend against it.
Wang
@jldupont, you are right that in some cases that overhead should be taken into account, but we're dealing with a site heavy on image content, so that one cookie is not significant in any way. @Wang, I edited my answer to show how to escape the data.
Tatu Ulmanen
@Tatu: still, if ETag/Last-Modified headers are supported on the domain server, the images can be cached whilst headers in requests are always there... even in HEAD methods.
jldupont
A: 

You have 2 class of solutions:

  1. state-less
  2. state-full

You need to pick one: (#1) is of course not guaranteed (i.e. probability of showing same image to user is variable) whilst (#2) allows you guarantees (depending on the implementation of course).

Here is another suggestion you might want to consider:

Maintain state on the Client-Side through HTML5 localstorage (when available): the value of this option will only continue to increase as Web Browsers with HTML5 support increases.

jldupont
Is http://dev.w3.org/html5/webstorage/ what you mean? And does http://stackoverflow.com/questions/1194784/which-browsers-support-html5-offline-storage mean it's widely available?
Wang
@Wang: more precisely HTML5 localstorage (http://dev.w3.org/html5/webstorage/#the-localstorage-attribute) available on Chrome/Safari/FF. Note that I didn't say this was widely available **at the moment** but HTML5 will is rising and will continue to.
jldupont
@Wang, as you saw from the link, there is little consistency in browsers on how they implement local storage, so utilizing that would require copius amounts of code to work properly - and still you can't support all users (eg. mobile users) as is the case with using cookies.
Tatu Ulmanen