How to extract text from MS office documents in C#

views:

1896

answers:

+2 Q:

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

Skurmedel 2009-06-18 07:25:27

this is really greate! I am done with docx, and what about for the rest?

Elias Haileselassie 2009-06-18 09:22:22

You can "connect" to a xslx-file like it were a database with ODCB I think. A quite cumbersome solution I think. I have no idea on how to read .doc-files or .xls-files, so I can't help you there.Here is a reference for .xls files though: http://sc.openoffice.org/excelfileformat.pdf

Skurmedel 2009-06-18 10:32:34

I couldn't find anything better on XLSX than the specification itself sadly: http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%201%20(PDF).zip

Skurmedel 2009-06-18 10:37:59

+1 A:

Simple!

These two steps will get you there:

1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

joshcomley 2009-06-18 07:38:03

Only sad part with the Office interop library is that you need to have Office installed.

Skurmedel 2009-06-18 07:40:36

+5 A:

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

adrianbanks 2009-06-18 08:28:28

Interesting... a very sneaky solution :)

Skurmedel 2009-06-18 09:05:01

Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why.

adrianbanks 2009-06-18 11:03:45

Elias Haileselassie 2009-06-19 08:25:54

Does this solution work on PDF docs as well?

2010-02-22 16:40:10

Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)]

adrianbanks 2010-02-22 17:15:49

2 quick Qs - a) I am currently using the method outlined here - http://www.codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different?b) In the IFilter method you linked, the author does a:TextReader reader=new FilterReader(fileName);I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side:Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes;

2010-02-22 17:31:09

@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter.

adrianbanks 2010-02-22 21:59:33

ansaurus

tags:

views:

answers:

How to extract text from MS office documents in C#

related questions