ansaurus

Question

Convert plain text to latex code programmatically

Answer 1

+6 A:

\ { } $ ^ _ % ~ # &

are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).

Some additional pitfalls:

Not every line break in the text might be intended as a new paragraph.
If your users use a language other than English (or Latin), you will need to \usepackage something that deals with the encoding (like utf8) or convert the characters yourself (e.g. ä -> \"a).
As dmckee points out, quotes also need to be treated separately.

EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.

Heinzi 2009-11-08 16:47:08

Thanks, didn't think of the unicode stuff!

Noah 2009-11-08 20:57:04

Answer 2

+2 A:

Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.

She said "He didn't do it".

needs to be converted to

She said ``He didn't do it''.

which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.

Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes ('). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.

dmckee 2009-11-08 19:49:00

Answer 3

+2 A:

As Heinzi said, the following need attention:

\ { } $ ^ _ % ~ # &

Most can be escaped with a backslash, but \ becomes \textbackslash and ~ becomes \textasciitilde.

I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.

(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running latex, but it's disabled by default.)

Thomas 2009-11-08 19:53:44

Infinite loops that eat up your whole CPU are also malicious, and as TeX is turing complete, these are definitely possible. Please run your user programming code with a CPU limit.

pavpanchekha 2010-02-13 02:05:50

I think (but am not sure) that TeX will run out of stack space.

Thomas 2010-02-13 17:15:50

Answer 4

+1 A:

Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...

You can do this by changing the catcode of the character. The TeX Wikibook knows more.

\catcode`\$=12

will turn $ into an ordinary character. However, for some reason some characters don't come out as you'd expect. \ becomes a double open quote, { becomes a dash... and redefining } inside a group ({...}) makes TeX choke entirely.

Long story short: only recommended if you know what you're doing.

Thomas 2009-11-08 20:07:43

ansaurus

tags:

views:

answers:

Convert plain text to latex code programmatically

related questions