views:

738

answers:

9

I want to parse a PDF file from my C# app and create an audio file off it. How would I do that ?

I'm particularly looking for a good pdf to text library or a way to strip a pdf file off its text.

A: 

I guess it's a hard thing to do. Firstly you need to read the text in that pdf, and then use some mechanism of synthetic voice generation to create the audio content. Then you have to store it as an mp3.

Artur Soler
+2  A: 

You need the Speech SDK from Microsoft. Read an instruction here

jao
+4  A: 

Use Festival for the text to speech. Various pdf to text api's exist...

dicroce
A: 

On Mac OS X, you can extract the text of the pdf and then pipe it in "say". You should find equivalent synthetisers on other OS.

SnippyHolloW
A: 

Take a look at this SAPI tutorial.

seb
A: 

It's not all that complicated to do, provided that you don't re-invent the wheel, but instead simply reuse existing technology (i.e. text to speech engines like festival), as well as OCR engines to process the PDF files.

The most complicated thing probably is to work with different PDF layouts (columns, rows, embedded graphics,foot notes, URLs etc), which may obfuscate the text recognition process.

However, in general (if this is not supposed to be a learning experience), it is certainly easier to just resort to using existing software solutions:

none
+4  A: 

You preferably have a tagged PDF document as your input document. This means that the document contains tags to mark up the logical structure of the document (typically a PDF document will only contain visual information).

This PDF could then be converted into DAISY format, which is a standard for digital talking books, i.e. an intermediate XML format storing the text of books along with the logical structure and navigation features.

This Daisy XML format can be either converted to an audio format, or you could be using a Daisy reader, a physical device like an MP3 player to listen to the book.

There is a presentation available at the Daisy web site explaining the principles of this toolchain:

Accessible PDF to DAISY/NIMAS Conversion

0xA3
+2  A: 

As the other posters outlined, first you have to extract the text from the .pdf file. pdf files are an open format now, so you can probably find a parser through Google.

Then you have to extract the text you want to convert to speech from the file, ignoring things like figure titles, page headers, table of contents etc.

Once you've got the text, you need to convert it to speech. This is probably the hardest part.

A while ago I was fiddling around with generating voice files for a gaming mod, since I'm a rotten voice actor.

Cepstral had the best TTS converters I could find. (The free ones had an annoying tendency to insert Cepstral advertisements in the speech, but I could manually edit this out for what I was doing.)

It turns out that there's a speech synthesis markup language which can be used to provide clues to the TTS converter about which syllable to place accents, etc. Here's a linky:

http://www.w3.org/TR/speech-synthesis/

How you go about automatically adding the SSML to the text is a bit beyond me.

Anyway, the TTS converter will produce an audio file, and the final step would be to compress the audio at the desired bit rate in mp3 format.

billmcc
+2  A: 

If your sole task is to listen to speech synthesized text from a PDF, how about the Acrobat "Read out loud" function at the bottom of the "View" menu?

spender