views:

1145

answers:

3

i need to convert a pdf file to txt file (or doc, but i prefer txt) in c# can someone tell me how can i do it? i saw something about it when i searched i Google but i didn't understand where i should put the file. so can someone tell me what references i need to do and what files do i need to add and where and what the code for the conversion?

A: 

The concept of converting PDF to text is not really straight forward and you wont see anyone posting a code here that will convert PDF to text straight. So your best bet now is to use a library that would do the job for you... a good one is PDFBox, you can google it. You'll probably find it written in java but fortunately you can use IKVM to convert it to .Net....

Red Serpent
+1  A: 

I've had the need myself and I used this article to get me started: http://www.codeproject.com/KB/string/pdf2text.aspx

Don
A: 

Ghostscript could do what you need. Below is a command for extracting text from a pdf file into a txt file (you can run it from a command line to test if it works for you):

gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "test.pdf" -c quit >"test.txt"

Check here: codeproject: Convert PDF to Image Using Ghostscript API for details on how to use ghostscript with C#

serge_gubenko
tanks!!!it's working, but there is a problem, it's not saving to the txt file, it's just create it and it's remain empty..why isn't it work?i runned it like that:C:\>C:\gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -d -c save -f ps2ascii.ps "C:\New Folder\2\test.pdf" -c quit >"c:\test.txt"
if you would run it like this: gswin32.exe "C:\New Folder\2\test.pdf" will it show you the file?also you might want to try running it from the bin folder of the gs, smth like this: C:\Program Files\gs\gs8.64\bin>gswin32c.exe ....in any case gs should give you an error if it can't find\parse your file, pls, post it up here if still no luck converting your file
serge_gubenko
i tried to do:C:\Program Files\gs\gs8.64\bin>gswin32.exe "C:\New Folder\2\test.pdf"and the program told me that it can't parse the file (but it showed me the pdf file)which is wierd, because when i didgswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" > "c:\test.txt"it did convert it, the only problen is that it create the file but don't write into it.... is this suppose to work in windows?
it has to work on windows and works fine for me; there are could be problems with parsing pdf files but ususally you get an error message from gs with an explanation of what is missing or broken; can you post up your pdf file somewere on file sharing service so I could try converting it
serge_gubenko
http://www.megafileupload.com/en/file/170875/test-pdf.htmlthere is the link for the file i want to convert.i don't think u will have a problem to convert it, i succeeded to convert it, but the problem is that it not svaing it to the txt filethere is the command again:gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" > "c:\test.txt"
tested your file and it worked fine; the prblem is in the executable your're using which is gswin32.exe; whereas you have to use gswin32c.exe (c == console); here's how I called it: gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" -c quit >"c:\test.txt"
serge_gubenko
ups sorry; gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" -c quit >"c:\test.txt"
serge_gubenko
wow!! it works!! tnx!!!but there is still has a tiny problem if there is a bold word then in some pdf files they are not parset right and the word is cut in the middle or every word in seperate is there something to do with that? i uploaded an example file. u can see it clearly in the firs line but there are some other words like that in the other line (where there were a bold line): http://www.megafileupload.com/en/file/170969/test-txt.htmland another question, i need to convert 15000 pdf files (for my project) it's ok if i'll do a loop in c# and run this program for each file from a cmd?
regarding the 15000 pdf files; check the link I gave you in the original reply http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx for the details on how you can use gsdll32.dll in your c# project. 15k files is a lot but shouldn't be a problem for gs, besides you never said is that a total number or you're going to receive it for instance per hour. As an alternative you can call 2..n instances of gswin32c.exe in parallel from different threads and point them to different files from your set, this shouldn't require a lot of coding to implement. I'll take a look at the file...
serge_gubenko
sorry, misunderstood your question regarding if it's ok to run the program from cmd for all your files set -- yes, I don't see any problem with it; should work fineregarding words separation; I don't think gs would be able to remove those; but I guess you can post process the txt file afterwords and remove those in your application
serge_gubenko
ok. anks alot! you realy helped me!!i'll mak a program that call gs from c#, thers is no need in what was said in your link because i can execute the cmd comman from c#.so i'll just make a loop.and the time is ok, it can tke 24 hours, i dont care.i can't post process the txt files, there is a lot of them...anyway tanks!!!
about the word seperation, i dont want to remove them-they are important, i want them in normal mode (unseperated)
i did'nt succeeded to run it through C# or java.is there an automatic way to run it in the parameters u gave me and change the input and output files?
check this thread for details on how you can run gswin32c.exe with parameters from your c# application: http://stackoverflow.com/questions/1941118/asp-converting-pdf-to-a-collection-of-images-on-the-server-using-ghostscript/1944348#1944348
serge_gubenko