views:

42

answers:

1

I have a crawler that downloads pages and tries to parse the HTML. One of the issues I've been facing is how to properly determine what mimetype an HTML file is.

Right now I'm using

is = new ByteArrayInputStream( htmlResult.getBytes( "UTF-8" ) );
mimeType = URLConnection.guessContentTypeFromStream(is);

but it misses sites like this: http://www.artdaily.org/index.asp?int_sec%3D11%26int_new%3D39415 because of the extra space between the doc tag and HTML tag in the source.

Does anyone know a good way to determine if a string is HTML or not? Searching for or some other tag wouldn't necessarily work because of text being embedded in binary files I may come across.

thanks

+1  A: 

Do you have control over the http connection that you crawler uses? Then how about checking the HTTP response header "Content-type". Thats one way to determine the content type. I just did a quick test of the artdaily.com to see if the content type header was sent. And there is one that has a value text/html

naikus
alot of times the content type isn't being sent or I've also found it's being sent as text/html when in fact it's a video or pdf file. So I can't seem to rely on the server's content type