tags:

views:

130

answers:

4

Hi,

I tried string[] file = File.ReadAllLines(file_name) to read a word file.

In debug mode i found that the first few arguments of the string array file are having values like

"��ࡱ�0\0\0\0>\0\0��\t\0\0\0\0\0". How can i get rid of this.

In certain files the first 3 arguments of the file[] are filled with these while for few files only the first argument is filled with these unreable characters.

What is the problem and how can i get rid of this.? But my word file does not even have a blank line at the beginning.

+1  A: 

If you are using .NET 3.5 then I'd suggest that you use a LINQ where clause to return only the lines that you're interested in.

string[] file = File.ReadAllLines(file_name).Where(line => !line.StartsWith("��")).ToArray();

You could also use some form of regular expression instead of the line.StartsWith() method.

Note: If you are reading Microsoft Office Word files I'd recommend that you use the COM Interop or 3rd party library to read the MS Word Document (you'll find it much easier than trying to parse the file yourself).

Kane
+3  A: 

The problem is you're not opening the file with the correct encoding. Here is a guide to opening and creating Word documents from C#.

Yuriy Faktorovich
+2  A: 

File.ReadAllLines is intended for text files. Word files are not text files. To read Word files you might need a library.

Darin Dimitrov
+1  A: 

Word files are not simple text files, so will have additional binary information embedded.

You should use a library that will read word documents if you want to extract the text properly, instead of File.ReadAllLines.

Here are a couple of such libraries.

Oded