views:

23

answers:

2

How does for example components like the "Total Commander " search can open every file format And search inside it ? Is there free library that offer me such feature ? Basically in the end I will like to extract texts from files be able to support all formats ( pdf,Microsoft doc ,chm …)

A: 

The programs that seem to do so, actually don't. They delegate the task to extractors installed on your system. If you do not have an extractor for the .foo file format, no program will be able.

This is of course no surprise, when you realize that there's no way another program can know how I stored text in .MyOwnFormat files.

MSalters
so again how does total commander does it , does it use "extractors" ?mybe iFilter ?
iFIlter is Microsoft Windows' interface for such extractors. I assume that's how Total Commander works on Windows.
MSalters
A: 

I believe actually, Total Commander treats all these files as plain text (maybe with some codepage guessing or simply trying all codepages). For example if you look closely into .doc file as plain text file, you'll find it's text among binary data which is suffice for searching. Oh, and some kind of archiver detection routine is almost certanly used, because MS Office 2007 and OpenOffice use ZIP for compressing it's files and it's useless to search text in compressed file without unpacking it.

n0rd
IIRC it's not sufficient for any serious use. I.e. the string "foo bar" will only be found if there is no change in foramtting. As a result, "foo _bar_ " won't be found.
MSalters
That depends on implementation actually. This exact case have quite easy workaround. I am pretty sure that Total Commander just searches for substring (taking into account different possible codepages). But, searching the text in files and extracting exact document text are quite different tasks.
n0rd
Assuming that TC I am talking about (this one: http://www.ghisler.com/ ) is the same as topic starter is talking about.
n0rd
Also MS Word particularly seem to store text formatting as some separate information from the text itself. So "foo *bar* " is stored as "foo bar".
n0rd