views:

1171

answers:

6

There is a bug in Firefox (even in the new betas and in minefield releases) which prevents the caching of certain files because of the algorithm for creating a key in their cache hash. Here is a link to the source code of the function.

I want to ensure that all of my site's files can be cached. However, I do not understand why their hashing function fails to create unique keys for distinct urls. I am hoping someone can describe this malfunction in psuedo-code or java.

It would be good to create a utility for developers to ensure unique urls until this bug is fixed.


EDIT: There have been some very helpful answers, however, I need more step-by-step help to create a utility to check for these cache mixups. It would be great to get some java code which can reproduce the keys that firefox is creating. Therefore, the opening of a bounty on this question.


EDIT 2: Here is a partially working java port (written using processing). Note the tests at the bottom; the first three work as expected, but the others do not. I suspect something regarding signed / unsigned ints. Suggestions?

//
// the bad collision function
// http://mxr.mozilla.org/mozilla/source/netwerk/cache/src/nsDiskCacheDevice.cpp#240
//

//248 PLDHashNumber
//249 nsDiskCache::Hash(const char * key)
//250 {
//251     PLDHashNumber h = 0;
//252     for (const PRUint8* s = (PRUint8*) key; *s != '\0'; ++s)
//253         h = PR_ROTATE_LEFT32(h, 4) ^ *s;
//254     return (h == 0 ? ULONG_MAX : h);
//255 }

//
//  a java port...
//

String getHash( String url )
{

//get the char array for the url string
char[] cs = getCharArray( url );

int h = 0;

//for (const PRUint8* s = (PRUint8*) key; *s != '\0'; ++s)
for ( int i=0; i < cs.length; i++ )
{  h = PR_ROTATE_LEFT32(h, 4) ^ cs[i];
}

//looks like the examples above return something in hex.
//if we get matching ints, that is ok by me.
//but for fun, lets try to hex the return vals?
String hexVal = hex( h );
return hexVal;
}

char[] getCharArray( String s )
{
  char[] cs = new char[s.length()];
  for (int i=0; i<s.length(); i++)
  { 
    char c = s.charAt(i);
    cs[i] = c;
  } 

  return cs;
}

//
// how to PR_ROTATE_LEFT32
//

//110 /*
//111 ** Macros for rotate left and right. The argument 'a' must be an unsigned
//112 ** 32-bit integer type such as PRUint32.
//113 **
//114 ** There is no rotate operation in the C Language, so the construct
//115 ** (a << 4) | (a >> 28) is frequently used instead. Most compilers convert
//116 ** this to a rotate instruction, but MSVC doesn't without a little help.
//117 ** To get MSVC to generate a rotate instruction, we have to use the _rotl
//118 ** or _rotr intrinsic and use a pragma to make it inline.
//119 **
//120 ** Note: MSVC in VS2005 will do an inline rotate instruction on the above
//121 ** construct.
//122 */
//...
//128 #define PR_ROTATE_LEFT32(a, bits) _rotl(a, bits)


//return an int (32 bit).  what do we do with the 'bits' parameter?  ignore?
int PR_ROTATE_LEFT32( int a, int bits )
{    return (a << 4) | (a >> (32-bits)); 
}

//
// examples of some colliding hashes
// https://bugzilla.mozilla.org/show_bug.cgi?id=290032#c5
//

//$ ./hashit "ABA/xxx.aba"
//8ffac222
//$ ./hashit "XyZ/xxx.xYz"
//8ffac222
//$ ./hashit "CSS/xxx.css"
//8ffac222
//$ ./hashit "JPG/xxx.jpg"
//8ffac222

//$ ./hashit modules_newsfeeds/MenuBar/MenuBar.css
//15c23729
//$ ./hashit modules_newsfeeds/ListBar/ListBar.css
//15c23729

//$ ./hashit modules_newsfeeds/MenuBar/MenuBar.js
//a15c23e5
//$ ./hashit modules_newsfeeds/ListBar/ListBar.js
//a15c23e5



//
// our attempt at porting this algorithm to java...
//

void setup( )
{

String a = "ABA/xxx.aba";
String b = "CSS/xxx.css";
String c = "CSS/xxx.css";
String d = "JPG/xxx.jpg";

println( getHash(a) ); //yes 8ffac222
println( getHash(b) ); //yes 8ffac222
println( getHash(c) ); //yes 8ffac222
println( getHash(d) ); //no [??] FFFFFF98, not 8ffac222

println( "-----" );

String e = "modules_newsfeeds/MenuBar/MenuBar.css";
String f = "modules_newsfeeds/ListBar/ListBar.css";

println( getHash(e) ); //no [??] FFFFFF8C, not 15c23729
println( getHash(f) ); //no [??] FFFFFF8C, not 15c23729

println( "-----" );

String g = "modules_newsfeeds/MenuBar/MenuBar.js";
String h = "modules_newsfeeds/ListBar/ListBar.js";

println( getHash(g) ); //yes [??] FFFFFF8C, not a15c23e5
println( getHash(h) ); //yes [??] FFFFFF8C, not a15c23e5

}
+5  A: 

From what I understand of just reading the bugzilla entry, the bug manifests when two distinct problems occur:

  1. Their hash algorithm generates collisions for urls that are "similar enough". From the bug "similiar enough" seems to mean every 4 characters (or perhaps 8) the urls are the same, and
  2. Their logic for dealing with hash collisions fails because they haven't flushed the previous url with the same hash value to disk yet.

So basically, if you have a page with two very similar urls this might happen on some versions of Firefox. It generally won't happen on different pages, I would expect, since then FF will have time to flush the entries to disk avoiding the timing issue.

So if you have multiple resources (scripts, images, etc) that are all loaded from the same page, make sure they have a run of 9 characters that are completely different. One way you might ensure this is by appending a querystring (that you ignore) with a random bit of data, something like:

jeffamaphone
Yeah, I read bytes where it should have been bits and mentally converted that to characters. Others below have good explanations of the hashing algorithm.
jeffamaphone
Suggestion of a query string is good, but would like to ensure unique urls for my files as a pre-process.
jedierikb
Also, adding a random querystring at runtime requires caching that random querystring somewhere versus developing a pattern which doesn't collide.
jedierikb
+1  A: 

First, you cannot hash uniquely all strings to integers (obviously, there are more strings than (fixed size) integers, so there have to be collisions). You can have a hashtable that can hold all sets of data (eg. all your files), but to get it, you need to change the code of the hashtable, not the hashing function.

Second, I see a problem with the hashing function you posted, in this part:

PR_ROTATE_LEFT32(h, 4)

If it really does rotation of h (I haven't checked on this), rotating by 4 means that strings having two 8-byte (I assume 32-bit hash) parts swapped (eg. xxxxxxxxyyyyyyyy vs. yyyyyyyyxxxxxxxx) will have equal hash. If you change it to something relatively prime to the hash size (eg. 5), this will only happen for swapped parts of length 32.

jpalecek
I think the question he is asking is 'how can i work around this poor hash function', not 'how can i build a better hash function'
FryGuy
+3  A: 

Here is how the algorithm works:

initialize hash to 0
for each byte
    shift hash 4 bits to left (with rotate)
    hash = hash XOR character

visually (16-bit version):

00110000             = '0'
    00110001         = '1'
        00110010     = '2'
            00110011 = '3'
0100            0011 = '4'
00110101             = '5'
====================
01000110001000010000  (and then this will be 'rotated'
                       so that it lines up with the end)
giving:
        00100001000001000110

What this means is that if you have strings of the same length and are mostly the same, then in at least one case, the lower 4 bits of a char and upper 4 bits of the next char xor each other must be unique. However, the method of sticking the 32 bit number into a table might be ever weaker, meaning that it requires the lower4 xor upper4 of a particular location in the string (mod 8 chars) be unique.

FryGuy
A: 

You're apparently mistaken about the real bug. Sure, there are hash collisions due to the incredeibly bad choice of a hash algorithm. But even hash(x)=1 wouldn't cause the problems described. It would merely turn an O(1) lookup into an O(N) linked list search through the first bucket.

The real problem is that Firefox fails to deal with hash collisions. It therefore requires a perfect hash of all URLs. "All URLs" unfortunately is a set outside of your control.

MSalters
I can at least ensure that my site's subset of "all urls" don't collide with a pre-processing utility for my site.
jedierikb
+2  A: 

This bug was a major issue for my site: http://worldofsolitaire.com

I worked around it a long time ago by using a conditional rule in an .htaccess file that would disable ALL caching of images on the site for Firefox users. This was a horrible thing to need to do, but at the time I couldn't track down the bug within Firefox and having the site be slightly slower is better than showing duplicate/corrupted images.

When I read in the linked bug that it was fixed in the latest Firefox releases, I changed the conditional on April 19th 2009 (yesterday) to only disable caching for Firefox 2 users.

A few hours later I've received over 10 e-mails from Firefox 3 users (confirmed) that they were seeing duplicate images. So this issue is STILL a problem in Firefox 3.

I decided to create a simple Linux test program that would allow me to check URL's to see if they are generating the same cache hash keys.

To compile in any Linux system: g++ -o ffgenhash ffgenhash.cpp

Here is the code (save to file ffgenhash.cpp)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define ULONG_MAX 0xFFFFFFFF
#define PR_ROTATE_LEFT32(a, bits) (((a) << (bits)) | ((a) >> (32 - (bits))))

unsigned long ffgenhash(const char * key)
{
    unsigned long h=0;

    for(const unsigned char * s = (unsigned char *) key; *s != '\0'; ++s)
    {
        h = PR_ROTATE_LEFT32(h, 4) ^ *s;
    }

    return (h==0 ? ULONG_MAX : h);
}

int main(int argc, char ** argv)
{
    printf("%d\n", ffgenhash(argv[1]));
    return 0;
}

As you can see, here are two real life URL's that generate the same cache hash key:

./ffgenhash "http://worldofsolitaire.com/decks/paris/5/12c.png"
1087949033
./ffgenhash "http://worldofsolitaire.com/decks/paris/5/13s.png"
1087949033

Since I pre-load these images in a Javascript loop, trying to use some sort of empty <script> tag workaround is not possible here.

Indeed I think my only real solution is to modify the URL's for Firefox users in some way to generate a unique cache hash key. So that's the approach I'll use.

By the way, I'm half tempted to create a Firebug addition that will check all resources loaded by a site and give a big error if two resources on the site share a common hash key so the developer is aware. It would be great to run sites like Google maps through this as I've seen weird things with those images over the past few years :)

Sembiance
+1  A: 

This is modified version of Sembiance's hash generator which works correctly even when compiled on 64-bit platform:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define ULONG_MAX 0xFFFFFFFF
#define PR_ROTATE_LEFT32(a, bits) (((a) << (bits)) | ((a) >> (32 - (bits))))

unsigned int ffgenhash(const char * key) {
    unsigned int h=0;
    for(const unsigned char * s = (unsigned char *) key; *s != '\0'; ++s) {
        h = PR_ROTATE_LEFT32(h, 4) ^ *s;
    }
    return (h==0 ? ULONG_MAX : h);
}

int main(int argc, char ** argv) {
    printf("%u\n", ffgenhash(argv[1]));
    return 0;
}
Darwin