ansaurus

Question

Replace all "\" characters which are *not* inside "<code>" tags

Answer 1

+3 A:

I reckon I could solve this using negative LookBehinds and/or LookAheads.

You reckon wrong. Regular expressions are not a replacement for a parser.

I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?

troelskn 2009-11-23 15:31:56

+1: Couldn't agree more. THe pain and suffering OP will suffer mangling that regex to work will still leave dozens of corner cases.

Jed Smith 2009-11-23 15:46:37

This is *exactly* what I have in mind! I am very well aware that the current solution is not "beautiful". But it works. The little bug with the code blocks is not a show-stopper. So, given the resoure-constraints, I'd rather prefer fiddling with one line of regex, than restarting from scratch... As I already said in the Q. ;)@JedSmith: Agreed. Which is why the parser is already on the future roadmap.

exhuma 2009-11-23 15:59:20

Answer 2

+2 A:

Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:

Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
Replace all backslashes inside that section with something unique like #?#?#?#
Replace the section found in 1 with that new section
Replace all backslashes with $\backslash$
Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
Replace #?#?#?# with \

FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.

Tim Pietzcker 2009-11-23 15:46:44

Thanks. Good idea!

exhuma 2009-11-23 16:00:46

... and thanks for actually reading, and taking into account the part that I am already thinking about a real parser *hug* :)

exhuma 2009-11-23 16:01:45

Answer 3

A:

Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.

((?:^|</code>)(?:(?!<code>).)+?)\\
    |            |              |
    |            |              \-- backslash
    |            \-- least amount of anything not followed by <code>
    \-- start-of-string or </code>

And replace it with:

$1$\backslash$

You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.

Andomar 2009-11-23 15:55:36

Answer 4

A:

Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

Gumbo 2009-11-23 15:57:12

Not answering the question... I now added a quick para explaining why I did not use a parser in the beginning.

exhuma 2009-11-23 16:15:50

Answer 5

+6 A:

If me, I will try to find HTML parser and will do with that.

Another option is will try to chunk the string into <code>.*?</code> and other parts.

and will update other parts, and will recombine it.

$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";

$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);

for($i=0;$i<count($r);$i+=2)
    $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);

$x=implode($r);

echo $x;

Here is the results.

The Hello $\backslash$ World document is located in: 
C:\documents\hello_world.txt

Sorry, If my approach is not suitable for you.

S.Mark 2009-11-23 15:59:55

+1 Great, and way more helpful than the usual "write your own parser" :)

Andomar 2009-11-23 16:04:39

I see only one problem, will this also work if the input starts with a code tag? You should probably check for this. BTW, when imploding the array, the code tags won't be put back with this solution. You could reuse the "for" loop for this purpose.

dutchflyboy 2009-11-23 16:12:40

It is. Thanks! Splitting the string like this seems actually quite suitable!

exhuma 2009-11-23 16:14:08

@dutchflyboy: yup. Thought about the same.

exhuma 2009-11-23 16:18:10

S.Mark 2009-11-24 02:15:39

Answer 6

+1 A:

Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?

With your "expected input" and the command pandoc -o text.tex test.html the output is:

The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!

pandoc can read from stdin, write to stdout or pipe right into a file.

Mica 2009-11-23 17:05:23

Very interesting. I'll have a closer look on that tomorrow.

exhuma 2009-11-23 22:37:26

I've had the best results with pandoc than with any other converter or what have you. it installs on win, mac, or linux. The more "vanilla" your latex code, the better it works. For example, I like to make heavy use of the latex package wrapfig when I can, but pandoc doesn't play very nice with it. But for your WYSIWYG editors, it should go well. Don't forget about pandoc's html to markdown converter (then you could go markdown to latex if necessary).

Mica 2009-11-23 23:23:59

ansaurus

tags:

views:

answers:

Replace all "\" characters which are not inside "<code>" tags

Please read

A little bit of background (you can skip this...)

The current state (... and this)

The problem

My attempt

Example Input

Expected output

A related question

related questions