tags:

views:

455

answers:

4

I'd like to take some user input text and quickly parse it to produce some latex code. At the moment, I'm replacing % with \% and \n with \n\n, but I'm wondering if there are other replacements I should be making to make the conversion from plain text to latex.

I'm not super worried about safety here (can you even write malicious latex code?), as this should only be used by the user to convert their own text into latex, and so they should probably be allowed to used their own latex markup in the pre-converted text, but I'd like to make sure the output doesn't include accidental latex commands if possible. If there's a good library to make such a conversion, I'd take a look.

+6  A: 

Apparently, the following characters

\ { } $ ^ _ % ~ # &

are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).

Some additional pitfalls:

  • Not every line break in the text might be intended as a new paragraph.
  • If your users use a language other than English (or Latin), you will need to \usepackage something that deals with the encoding (like utf8) or convert the characters yourself (e.g. ä -> \"a).
  • As dmckee points out, quotes also need to be treated separately.

EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.

Heinzi
Thanks, didn't think of the unicode stuff!
Noah
+2  A: 

Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.

She said "He didn't do it".

needs to be converted to

She said ``He didn't do it''.

which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.


Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes ('). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.

dmckee
+2  A: 

As Heinzi said, the following need attention:

\ { } $ ^ _ % ~ # &

Most can be escaped with a backslash, but \ becomes \textbackslash and ~ becomes \textasciitilde.

I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.

(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running latex, but it's disabled by default.)

Thomas
Infinite loops that eat up your whole CPU are also malicious, and as TeX is turing complete, these are definitely possible. Please run your user programming code with a CPU limit.
pavpanchekha
I think (but am not sure) that TeX will run out of stack space.
Thomas
+1  A: 

Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...

You can do this by changing the catcode of the character. The TeX Wikibook knows more.

\catcode`\$=12

will turn $ into an ordinary character. However, for some reason some characters don't come out as you'd expect. \ becomes a double open quote, { becomes a dash... and redefining } inside a group ({...}) makes TeX choke entirely.

Long story short: only recommended if you know what you're doing.

Thomas