views:

164

answers:

4

I have a c# component that will recieve a file of the following types .doc, .pdf, .xls, .rtf

These will be sent by the calling siebel legacy app as a filestream.

So...

[LegacyApp] >> {Binary file stream} >> [Component]

The legacy app is a black box that cant be modified to tell the component what file type (doc,pdf,xls) it is sending. The component needs to read this binary stream and create a file on the filesystem with the right extension.

Any ideas?

Thanks for your time.

A: 

On linux, there is a command called file. Given an arbitrary file, it attempts to determine what kind of file it is. For instance:

gzip compressed data, from Unix, last modified: Fri Jun 12 20:16:28 2009
HTML document text
vCalendar calendar file
RCS/CVS diff output text

Those are from a few random files lying around my home directory.

retracile
Im working on a .net component that will be deployed on a windows environment.
A: 

Yep. See file.

And please do not reinvent the wheel. It works just fine how it is.

amphetamachine
Of course this particular wheel works under Linux. Not the common platform to target with c#.
Jens
@Jens - It's cross-platform, actually. Not the sort of platform to target with C#.
amphetamachine
Thanks Jens, I was looking at something like a file signature for each of those types I mentioned.
+4  A: 

On Linux/Unix based systems you can use the file command, but I assume you want to do this manually yourself in code...

If all you have access to is the byte stream of the file, then you would need to handle each file type independently.

Most programs/components that do what you are wondering usually read the first few bytes and make a classification based on that. For example GIF files start with one of the following: GIF87a or GIF89a

Many file formats have the same signature at the start of the file, or have the same header format. This signature is refered to as a magic number as described by me on this post.

A good place to get started is to go to www.wotsit.org. It contains the file format specifications searchable by file type. You could look at the important file types that you want to handle and see if you can find some identifying factor in those file formats.

You could also search Google to try and find a library that does this classification, or look at the source code of the file command.

Brian R. Bondy
Thanks. will look into this
If you want to handle in code, yeah your only options is to look at the bytes and figure out what the file type is based on that. Most files have some kind of header in the first few bytes describing the data, format, etc.
Justin
+1  A: 

You maybe interested in this: http://en.wikipedia.org/wiki/Magic_number_(programming)

Most binary formats contain a magic number at their beginning. If you only have to recognize a certain set of formats, it should be easy to check the first few bytes of a new incoming file and guess the appropriate file extension correctly.

xor_eq
Thanks, but the magic number seems to be the same across all MS Office files (doc,xls,rtf). I need to differentiate between these as well.