tags:

views:

539

answers:

11

Given these two images from twitter.

http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg

I want to download them to local filesystem & store them in a single directory. How shall I overcome name conflicts ?

In the example above, I cannot store them as *lowres_profilepic.jpg*. My design idea is treat the URLs as opaque strings except for the last segment. What algorithms (implemented as f) can I use to hash the prefixes into unique strings.

f( "http://a3.twimg.com/profile_images/130500759/" ) = 6tgjsdjfjdhgf
f( "http://a1.twimg.com/profile_images/58079916/" )  = iuhd87ysdfhdk

That way, I can save the files as:-

6tgjsdjfjdhgf_lowres_profilepic.jpg
iuhd87ysdfhdk_lowres_profilepic.jpg

I don't want a cryptographic algorithm as it this needs to be a performant operation.

+1  A: 

I see your question is what is the best hash algorithm for this matter. You might want to check this http://stackoverflow.com/questions/251346/best-hashing-algorithm-in-terms-of-hash-collisions-and-performance

Svetlozar Angelov
+4  A: 

The nature of a hash is that it may result in collisions. How about one of these alternatives:

  1. use a directory tree. Literally create sub directories for each component of the URL.
  2. Generate a uniques id. The problem here is how to keep the mapping between real name and saved id. You could use a database which maps between a URL and generated unique id. You can simply insert a record into a database which generates unique ids, and then use that id as the filename.
djna
I did think about using the database for this.
Jacques René Mesrine
Didn't you mention you wanted a performant solution?
drhirsch
All performance is relative - slipping one extra record to a local database probably compares pretty well with downloading an image. Sure it's not the fastest possible thing that could be done, but I would favour the simplest thing that could work until it's proven too slow.
djna
+4  A: 

One of the key concepts of a URL is that it is unique. Why not use it?

Every algorithm that shortens the info, can produce collisions. Maybe unlikely, but possible nevertheless

Peter
It looks like he's using sth correlated with twitter
furtelwart
This is the simplest approach, but he would need to watch out for the 255 character path limit on some OS (ie XP). Note the actual URL may be less than 255, but combined with the parent folder/s it may be longer and this is painful. Some URLs can be ridiculously long!
Ash
A: 

The git content management system is based on SHA1 because it has very minimal chance for collision.

If it good for git it will be good to you so.

Vereb
No cryptographic algos, see the question.
furtelwart
This is 2009 I can't imagine it is slow for short url-s.
Vereb
I know, but if question opener don't want to have cryptographic algos, your answer does not help.
furtelwart
This app runs on mobile phones.
Jacques René Mesrine
I agree with you in that it was in the problem description not to use it. Now I see it will be on mobile phone so I guess you don't need to parse thousands of urls/sec. A mobile can play videos which needs more computation. I just say it worth to measure the performance with sha1 with an existing library instead of a difficult method.
Vereb
Unfortunately floating point calculation is still expensive in managed environments on phones (it is normally emulated using integral arithmetic). Video playback is normally handled by native code. There is a difference.
Jacques René Mesrine
Cryptographic hashes don't use floating point math.
Nick Johnson
+2  A: 

A very simple approach:

f( "http://a3.twimg.com/profile_images/130500759/" ) = a3_130500759.jpg
f( "http://a1.twimg.com/profile_images/58079916/" )  = a1_58079916.jpg

As the other parts of this URL are constant, you can use the subdomain, the last part of the query path as a unique filename.

Don't know what could be a problem with this solution

furtelwart
What if Twitter changes their image hosting servers ? Just a year ago, profile pics were stored on s3.
Jacques René Mesrine
Hm, this could be a problem, indeed.
furtelwart
A: 

You said:

I don't want a cryptographic algorithm as it this needs to be a performant operation.

Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.

So you could have a Dictionary<string, string> to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.

Following this line, you could give MD5 a try:

public static void Main(string[] args)
{
    foreach (string url in new string[]{ 
        "http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg", 
        "http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
    {
        Console.WriteLine(HashIt(url));
    }
}

private static string HashIt(string url)
{
    Uri path = new Uri(new Uri(url), ".");
    MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
    byte[] data = md5.ComputeHash(
        Encoding.ASCII.GetBytes(path.OriginalString));
    return Convert.ToBase64String(data);
}

You'll get:

rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==
Rubens Farias
+3  A: 

It seems what you really want is to have a legal filename that won't collide with others.

  • Any encoding of the URL will work, even base64: e.g. filename = base64(url)
  • A crypto hash will give you what you want - although you claim this will be a performance bottleneck, don't be sure until you've benchmarked
orip
Yep, forget hashing, just base64 encode it and be done.
Keith Randall
+2  A: 

While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.

It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.

4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.

I've used this approach myself for something similar and was pleased with the performance. See Fast CRC32 in Software.

Ash
+5  A: 

Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.

The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N) (Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)

Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN). For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.

Stephen C
Nowadays filesystems usually use trees for their file structure.
Gumbo
A: 

It appears that the numerical part of twimg.com URLs are already a unique value for each image. My research indicates that the number is sequential (i.e. the example url below is for the 433,484,366th profile image ever uploaded - which just happens to be mine). Thus, this number is unique. My solution would be to simply use the numerical part of the filename as the "hash value", with no fear of ever finding a non-unique value.

  • URL: http:​//a2.twimg.com/profile_images/433484366/terrorbite-industries-256.png
  • Filename: 433484366.terrorbite-industries-256.png
  • Unique ID: 433484366

I already use this system for a Python script that displays notifications for new tweets, and as part of its operation it caches profile image thumbnails to reduce unneccessary downloads.

P.S. It makes no difference what subdomain the image is downloaded from, all images are available from all subdomains.

TerrorBite
A: 

I'm playing with thumbalizr using a modified version of their caching script, and it has a few good solutions I think. The code is on github.com/mptre/thumbalizr but the short version is that is used md5 to build the file names, and it takes the first to characters from the filename and uses it to create a folder which is named the exact same thing, this means that it is easy to break the folders up, and fast to find the corresponding folder without a database. Kind of blew my mind with it's simplicity.

It generates file names like this http://pappmaskin.no/opensource/delicious_snapcasa/mptre-thumbalizr/cache/fc/fcc3a328e0f4c1b51bf5e13747614e7a_1280_1024_8_90_250.png

the last part, _1280_1024_8_90_250, matches the different settings that the script uses when talking to the thumbalizr api, but I guess fcc3a328e0f4c1b51bf5e13747614e7a is a straight md5 of the url, in this case for thumbalizr.com

I tried changing the config to generate images 200px wide, and that images goes in the same folder, but instead of _250.png it is called _200.png

I haven't had time to dig that much in the code, but I'm sure it could be pulled apart from the thumbalizr logic and made more generic.

Morten Skogly