tags:

views:

34

answers:

3

I've got some data that is coming in through a byte stream. I want to determine its file type so I know how to parse it. At present, I'm only concerned about HTML or Images, everything else can be discarded.

What's an efficient method of differentiating between the two? And what if I want to expand this to include other file types?

+1  A: 

There is a wrapper for libmagic out there, but I don't know if it's actually alive/working.

Ignacio Vazquez-Abrams
+1  A: 

See this question for a solution.

Onkelborg
+1  A: 

This stackoverflow article discusses the same problem and is tagged with Python (this has nothing to do with programming languages though). They mention this article on file type signatures (not really signatures, but a common starting magic number for known file types). For security reasons, I would recommend getting the stream from a trusted source only if you're going to make this control your application logic in a non-trivial way.

Also, since you're just checking if a file is html or binary (at the moment), you might want to check for the existence of 0 in the byte stream (the byte, not the character), or just any illegal html character (e.g. 0x1).

steinar