tags:

views:

43

answers:

2

Hello!

This is a PDF file which has the marklist of certain exam. http://www.megaupload.com/?d=T9VM6P9E

I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.

I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).

Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?

I desperately need to sort this, any help would be highly appreciated. :)

A: 

I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.

Also you'll first need to use pdftk.exe to uncompress the file.

A shortcut that me be of aid: http://www.adobe.com/devnet/pdf/pdf_reference.html

This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx

ondesertverge
Thanks ondesertverge, but would you be able to prescribe the exact procedure? I'm not much familiar with PDF file format and stuff...I tried pdftk to uncompress it, but it said it couldn't open the PDF file. I was actually doing all I could to sort the list, so didn't get much time to read the documentation. Will see it later. Thanks anyway. :)
Ninad
A: 

Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.

But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.

Ninad
Is this a one time deal or do want to build a tool to do this on a regular basis?
ondesertverge