tags:

views:

1163

answers:

5

Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.

Corrupted, to me, is the image is loaded into Firefox and it says

The image “such and such image” cannot be displayed, because it contains errors.

Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.

Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?

I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.

Thanks!

+5  A: 

i suggest you check out imagemagick for this: http://www.imagemagick.org/

there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided

Niko
What is your (or anyone's) opinion about GraphicsMagick which is supposed to be a more stable fork of ImageMagick?
Todd
never played around with it - but i will give it a try - thanks for the info
Niko
Note that identify looks at the header only, so it should be quick, but it's not a guarantee against a corrupt image. Though I'm sure other bits of imagemagick can provide a more thorough check.
therefromhere
Then again, you're scrapping from 4chan, corrupt images is kind of half the point, isn't it? (I kid)
therefromhere
therefromhere is right. I tried identify out and it doesn't catch known broken images. Thanks anyways!
Joel Verhagen
+10  A: 

An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).

from PIL import Image

v_image = Image.open(file)
v_image.verify()

Catch the exceptions...

From the documentation:

im.verify()

Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.

ChristopheD
This is working for some of the the corrupted images. The advantage of this method is it is very fast. Thanks ChristopheD!
Joel Verhagen
This solution is so simple that I've wrapped it around a Python script to recursively checks for corrupt files. I'm posting here in the hope it helps anyone else: http://bitbucket.org/denilsonsa/small_scripts/src
Denilson Sá
+3  A: 

If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.

Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.

However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)

David
+2  A: 

In PHP, with exif_imagetype():

if (exif_imagetype($filename) === false)
{
    unlink($filename); // image is corrupted
}

EDIT: Or you can try to fully load the image with ImageCreateFromString():

if (ImageCreateFromString(file_get_contents($filename)) === false)
{
    unlink($filename); // image is corrupted
}

An image resource will be returned on success. FALSE is returned if the image type is unsupported, the data is not in a recognized format, or the image is corrupt and cannot be loaded.

Alix Axel
That only reads the first few bytes looking for an image header, that's not going to be enough to confirm the image isn't corrupt.
therefromhere
(though it's better than nothing, and it'd be quick)
therefromhere
@therefromhere: thanks, fixed it.
Alix Axel
The advantage of this method is that it checks the entire image for corruption. It's slower, but it is more thorough. Thanks eyze!
Joel Verhagen
I tried the second one, but I keep getting errors: "libpng warning: Ignoring bad adaptive filter type", "libpng warning: Extra compressed data", "libpng warning: Extra compression data", and so on that appear to be coming from the libpng c library rather than PHP when the image is corrupted. Anyone else run into this?
SeanJA
A: 

You could use imagemagick if it is available:

if you want to do a whole folder

identify "./myfolder/*" >log.txt 2>&1

if you want to just check a file:

identify myfile.jpg
SeanJA