tags:

views:

186

answers:

3

I want to parse the html file, pdf file, csv file and text file.Now parsing for which type of file (specified above) is easiest and efficient ?

Because I want to parse pdf ,html ,csv and text file through common parsing code if possible.

And now suppose if parsing for html is easiest and efficient then :

I will write the parsing code for html file and will try to convert pdf file to the html file(if possible)so the code written for parsing html file will also work for pdf file also.

And thus I will try to convert pdf,csv and text file to html file.And write the code for parsing html file and thus this code will parse html,pdf,csv and text file.

So (1) Which type of file parsing is easiest and efficient (pdf,csv,html,text) ? (2) And converting files(pdf,text,html,csv) to eachother is possible. Like if html parsing easiest then pdf to html,text to html and csv to html.

A: 

Just look at the files in a text editor.

Should be plainly obvious which one is going to be easiest.

ck
A: 

html, csv and text parsing is all equally easy, can't say which is most appropriate without know what the data you hoping to parse is. There is no difficulty in converting between them as long as you know what you want to do.

pdf is another ball game, it's going to be much harder, and is going to involve third party library to extract text from it first.

Paul Creasey
@Paul Creasey. From every type of file I want to extract table information.
Harikrishna
+3  A: 

You cannot parse all of the above file types with the same parser code.

The simplest format text - CSV and HTML are text files. Having said that, it doesn't mean that they are simple to parse. It really depends on what formatting they have.

PDF files are binary in nature, so will require a different parser.

In general, the more structured the data, the easier the parsing (so, CSV would be easiest and probably fastest).

I would suggest using existing parser instead of writing your own.

There are libraries around that will parse CSV and some other types of structured text (tab delimited for example) - see FileHelpers.

For HTML parsing there is the HTML Agilty Pack.

There are numerous PDF parsers, both free and commercial.

Oded
Of course you can. But the conversion of the PDF requires the PDF to be parsed and the parse tree to be converted to HTML.
Oded
+1. HTML Agility Pack is a great suggestion. Consider the free PdfBox or iTextSharp for PDF. Or for small $ I'm using QuickPDF which is really good.
kenny
@Harikrishna - I don't understand your option 2. How are you converting PDF to html? If not in code, than just use a HTML parser after converting the PDFs.
Oded
@Oded Sir..In option 2-- Should I first convert the pdf to html by writing parsing code and then parsing that information by html parser.Like, pdf to html and html to text. And in option (1)-- pdf to text and html text. Which one should I do ?
Harikrishna
@Oded Sir..At the end from every file I want text information which I want to display on the datagridview. So should I write different different parsing code for each file like pdf to text , html to text , csv to text ? Or should I do like first pdf to html,csv to html and then html to text ?
Harikrishna
Any of these options will work. You still need to do the conversions, so it doesn't really matter in respect to performance.
Oded
@Oded Sir..But what is appropriate and best and easiest ?
Harikrishna