tags:

views:

82

answers:

4

I'm in need of a tricky regex and I don't know if it can be written.

I'm trying to clean up some horrid html output from Ms Word. Here's an exmaple of the dandy that it does on an ordered (or numbered) list.

<p>1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>

<p>2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>

<p>3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </p>

Beautiful, isn't it? Paragraph tags and nonbreaking spaces...

I'm wondering if it's even feasible to write a regex to replace this with the following:

<ol>
<li>1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>

<li>2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>

<li>3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </li>
</ol>

The difficulty is that the number of &nbsp;s can vary from none to just a few to a lot and a list can be of varying lengths. Having no &nbsp;s seems to be rare, and it seems to happen only after a list gets larger (say when going from 9 to 10 or 99 to 100.)

Anyway, if such a thing is possible, that would be awesome. As it stands, I can search for long strings of &nbsp;s and then manually apply list formatting, but it's not as fast as automatic.

A: 

All those &nbsp; has no effect, What you need is this:

/<p>( *[0-9]+.*?)</p>/<li>\1</li>/
NawaMan
You need to put some \'s before some /'s I think.
Kinopiko
+1  A: 

First: all the standard replies apply to this question: you (should|can|may) not parse/process html (valid or not) using regex. For a wide range of reasons not to do this, I recommend searching the web and/or SO.

That said (and assuming your paragraph tags cannot be nested!), you can not do this in one replacement. You will first have to wrap <ol> and </ol> tags around your paragraphs that "look like" ordered lists. I assume that a paragraph is an ordered list when it starts with <p> NUMBER. (a paragraph tag, some spaces, a number and a full stop).

regex       : (?s)((?:<p>\s*\d+\.(?:(?!</p>).)*</p>\s*)+)
replacement : <ol>$1</ol>

A short explanation:

// regex
(?s)                # enable DOT-ALL matching
(                   # open group 1
  (?:               #   open non-matching group 1
    <p>\s*\d+\.     #     match '<p>', zero spaces, a number and a full stop
    (?:(?!</p>).)*  #     [when looking ahead, if there's no '</p>', only then match any character] zero or more times
    </p>            #     match '</p>'
    \s*             #     match zero or more white spaces
  )                 #   close non-matching group 1
  +                 #   non-matching group 1 one or more times
)                   # close group 1

// replacement
<ol>                # insert '<ol>'
$1                  # insert what is matched by the regex in group 1
</ol>               # insert '</ol>'

Now your string will contain:

<ol><p>1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>

<p>2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>

<p>3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </p></ol>

Next, replace all the paragraphs (including their numbers!) with <li> and </li> tags:

regex       : (?s)<p>\s*\d+\.((?:(?!</p>).)*)</p>
replacement : <li>$1</li>

Again, a short explanation:

// regex
(?s)               # enable DOT-ALL matching
<p>                # match '<p>'
\s*                # match zero or mroe white space characters
\d+                # match one or mroe digits
\.                 # math a dot
(                  # start group 1
  (?:(?!</p>).)*   #   [when looking ahead, if there's no '</p>', only then match any character] zero or more times
)                  # end group 1
</p>               # match '</p>'

// replacement
<li>               # insert '<li>'
$1                 # insert what is matched by the regex in group 1
</li>              # insert '</li>'

Now your string will look like:

<ol><li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </li></ol>

But again: be very very careful. When there's one little mistake in an opening or closing tag, you may very well end up with something that is far worse than what you've started with!

Bart Kiers
A: 

No, it is not feasible as a regular expression, because HTML is not a regular language.

Instead, take any HTML parser, find subsequent <p> nodes that are inside a common parent node and the contents of which begin with ordered numerals, and put them as <li> nodes into a new <ol> node.

Svante
+1  A: 

Not quite what you're asking for, but the HTML output from Microsoft Word has long been regarded by many as very poor, and many people have found themselves trying to clean it up. As a result, there are a good number of HTML-cleaning tools out there, and a quick search on Google suggests that the HTML Tidy Library Project, or others, may help you out. Don't reinvent the wheel unless you have to!

Tim