tags:

views:

590

answers:

10

Hi all

I have a problem in my web crawler where I am trying to retrieve images from a particular website. Problem is that often I see images that are exactly same but different in URL i.e. their address.

Is there any Java library or utility that can identify if 2 images are exactly same in their content (i.e. at pixel level).

My input will be URLs for the images where I can download them.

+1  A: 

I would think you don't need an image library to do this - simply fetching the URL content and comparing the two streams as byte arrays should do it.

Unless of course you are interested in identifying similar images as well.

levik
What if the image is compressed with a lossy algorithm? You could have two images that are the same but with different bytes.
Alex Reynolds
You could - but you're not likely to. Such images wouldn't be identical, but would be similar to each other. Mostly such pixel-perfect clones don't exist on the web - it will either be a byte-for-byte copy, or will have some pixels differ from the original. Perhaps it will have a badge in the corner
levik
+1  A: 

Why not write one yourself? It does not seem difficult. Check this out.

Shoban
+4  A: 

Depending on how detailed you want to get with it:

  • download the image
  • as you download it generate a hash for it
  • make a directory where the directory name is the hash value (if the directory does not exist)
  • if directory contains 2 or more files then compare the file sizes
  • if the file sizes are the same then do a byte by byte comparison of the image to the bytes of the images in the file
  • if the bytes are unique then you have a new image

Regardless of if you want to do all that or not you need to:

  • download the images
  • do a byte-by-byte comparison of the images

No need to rely on any special imaging libraries, images are just bytes.

TofuBeer
You really only need to do this if you use a weak hash function. Honestly, you really can just use fairly strong hash function and "trust the hash".
Neil Coffey
+3  A: 

Look at the MessageDigest class. Essentially, you create an instance of it, then pass it a series of bytes. The bytes could be the bytes directly loaded from the URL if you know that two images that are the "same" will be the selfsame file/stream of bytes. Or if necessary, you could create a BufferedImage from the stream, then pull out pixel values, something like:

  MessageDigest md = MessageDigest.getInstance("MD5");
  ByteBuffer bb = ByteBuffer.allocate(4 * bimg.getWidth());
  for (int y = bimg.getHeight()-1; y >= 0; y--) {
    bb.clear();
    for (int x = bimg.getWidth()-1; x >= 0; x--) {
      bb.putInt(bimg.getRGB(x, y));
    }
    md.update(bb.array());
  }
  byte[] digBytes = md.digest();

Either way, MessageDigest.digest() eventually gives you a byte array which is the "signature" of the image. You could convert this to a hex string if it's helpful, e.g. for putting in a HashMap or database table, e.g.:

StringBuilder sb = new StringBuilder();
for (byte b : digBytes) {
  sb.append(String.format("%02X", b & 0xff));
}
String signature = sb.toString();

If the content/image from two URLs gives you the same signature, then they're the same image.

Edit: I forgot to mention that if you were hashing pixel values, you'd probably want to include the dimensions of the image in the hash too. (Just to a similar thing-- write two ints to an 8-byte ByteBuffer, then update the MessageDigest with the corresponding 8-byte array.)

The other thing is that somebody mentioned is that MD5 is not collision-resistent. In other words, there is a technique for constructing multiple byte sequences with the same MD5 hash without having to use the "brute force" method of trial and error (where on average, you'd expect to have to try about 2^64 or 16 billion billion files before hitting on a collision). That makes MD5 unsuitable where you're trying to protect against this threat model. If you're not concerned about the case where somebody might deliberately try to fool your duplicate identification, and you're just worried about the chances of a duplicate hash "by chance", then MD5 is absolutely fine. Actually, it's not only fine, it's actually a bit over the top-- as I say, on average, you'd expect one "false duplicate" after about 16 billion billion files. Or put another way, you could have, say, a billion files and the chance of a collision be extremely close to zero.

If you are worried about the threat model outlined (i.e. you think somebody could be deliberately dedicating processor time to constructing files to fool your system), then the solution is to use a stronger hash. Java supports SHA1 out of the box (just replace "MD5" with "SHA1"). This will now give you longer hashes (160 bits instead of 128 bits), but with current knowledge makes finding a collision infeasible.

Personally for this purpose, I would even consider just using a decent 64-bit hash function. That'll still allow tens of millions of images to be compared with close-to-zero chance of a false positive.

Neil Coffey
That will not work, MD5 is not collision resistant (two different files can have the same MD5 hash), but it is a good start since the odds of a collision are low (you still need to do the byte-by-byte comparison if two MD5 hashes are the same though).
TofuBeer
See my edit-- I don't think the purpose in this case is to protect against that threat model. All we're probably looking for is a "good wide hash function", and MD5 is adequate for that.
Neil Coffey
Are you saying that it is impossible for two files to have the same hash except by attack? There are an infinite possible number of files that are being mapped to a finite number of hashes... (improbable and impossible are two very different things).
TofuBeer
They are different, but in this case, "improbable" means something in the order of "so improbable that there's more chance of a meteorite landing on top of you while you're writing the code".
Neil Coffey
P.S. I know this is counter-intuitive. But the "finite number" of hashes is 2^128, and that's a veeeeery big number!
Neil Coffey
+2  A: 

You could also generate a MD5 signature of the file and ignore duplicate entries. Won't help you find similar images though.

nevets1219
+1  A: 

calculate MD5s using something like this:

MessageDigest m=MessageDigest.getInstance("MD5");
m.update(image.getBytes(),0,image.length());
System.out.println("MD5: "+new BigInteger(1,m.digest()).toString(16));

Put them in a hashmap.

krosenvold
+3  A: 

I've done something very similar to this before in Java and I found that the PixelGrabber class inside the java.awt.image package of the api is extremely helpful (if not downright necessary).

Additionally you would definitely want to check out the ColorConvertOp class which can performs a pixel-by-pixel color conversion of the data in the source image and the resulting color values are scaled to the precision of the destination image. The documentation goes on to say that the images can even be the same image in which case it would be quite simple to detect if they are identical.

If you were detecting similarity, you need to use some form of averaging method as mentioned in the answer to this question

If you can, also check out Volume 2 chapter 7 of Horstman's Core Java (8th ed) because there's a whole bunch of examples on image transformations and the like, but again, make sure to poke around the java.awt.image package because you should find you have almost everything prepared for you :)

G'luck!

Yoely
A: 

Hashing is already suggested and recognizing if two files are identical is very easy, but you said pixel level. If you want to recognize two images even if they are in different formats (.png/.jpg/.gif/..) and even if they were scaled I suggest: (using an image library and if the image are medium/big no 16x16 icons):

  1. scale the image to some fixed size, it depends on the samples
  2. transform it to grey scale using the RGB-YUV conversion for exampel and taking Y from there (very easy) 3 Do the hamming distance of each image and set a threshold to decide if they are the same or not.

You will do a sum of the difference of all the grey pixels of both images you get a number if the difference is < T you consider both images identical

--

A: 

Inspect the response headers and interrogate the HTTP header ETag value, if present. (RFC2616: ETag) They maybe the same for identical images coming from your target web server. This is because the ETag value is often a message digest like MD5, which would allow you to take advantage of the webserver's already completed computations.

This may potentially allow you to not even download the image!

for each imageUrl in myList
    Perform HTTP HEAD imageUrl
    Pull ETag value from request
    If ETag is in my map of known ETags
       move on to next image
    Else
       Download image
       Store ETag in map

Of course the ETag has to be present and if not, well the idea is toast. But maybe you have pull with the web server admins?

Stu Thompson
+1  A: 

Maybe you could check this one. http://www.java-forums.org/java-2d/22880-identify-pixel-image.html

Fion