views:

185

answers:

2

I've just got my hands on a Stackoverflow data dump, and I'm disappointed to see that the Body field of the posts is in HTML rather than Markdown. I suspect there's Markdown in the original database because that's what I see if I try to edit an answer.

I want to recover Markdown from a large set of answers. I will be processing hundreds of entries in batch mode, using either command-line tools or some kind of Lua or C library, so an interactive tool like the wmd Markdown editor is not suitable. Can people say what tools are available to help me recover Markdown from a Stackoverflow data dump?


(Related question, not a duplicate: Convert HTML back to Markdown within wmd.)

+4  A: 

Markdownify converts HTML to Markdown.

See Also: MetaSO / Can Markdown be recovered from the SO data dump?

Jonathan Sampson
Norman should know, he asked that question too! :)
Andrew Keeton
When it comes to using PHP on the command line, I am a troglodyte. I can't seem to figure out from the manual if there is a library function to read the entire contents of a file. Is dio_read(STDIN) on the right track?
Norman Ramsey
If you want to read the contents of a file, there are many ways - a simple function that does it is `file_get_contents();`
Jonathan Sampson
+2  A: 

take a look at pandoc:http://johnmacfarlane.net/pandoc/

there is an html2markdown tool included with pandoc that works pretty well, and the program is run from the command line, making batch conversion quite nice.

here is the man page: http://johnmacfarlane.net/pandoc/html2markdown.1.html

Mica
Looks awesome! I will definitely check it out.
Norman Ramsey