ansaurus

Question

RegEx to convert Word output to html order list

Answer 1

A:

All those   has no effect, What you need is this:

/<p>( *[0-9]+.*?)</p>/<li>\1</li>/

NawaMan 2009-10-22 05:21:09

You need to put some \'s before some /'s I think.

Kinopiko 2009-10-22 06:02:34

Answer 2

+1 A:

First: all the standard replies apply to this question: you (should|can|may) not parse/process html (valid or not) using regex. For a wide range of reasons not to do this, I recommend searching the web and/or SO.

That said (and assuming your paragraph tags cannot be nested!), you can not do this in one replacement. You will first have to wrap <ol> and </ol> tags around your paragraphs that "look like" ordered lists. I assume that a paragraph is an ordered list when it starts with <p> NUMBER. (a paragraph tag, some spaces, a number and a full stop).

regex       : (?s)((?:<p>\s*\d+\.(?:(?!</p>).)*</p>\s*)+)
replacement : <ol>$1</ol>

A short explanation:

// regex
(?s)                # enable DOT-ALL matching
(                   # open group 1
  (?:               #   open non-matching group 1
    <p>\s*\d+\.     #     match '<p>', zero spaces, a number and a full stop
    (?:(?!</p>).)*  #     [when looking ahead, if there's no '</p>', only then match any character] zero or more times
    </p>            #     match '</p>'
    \s*             #     match zero or more white spaces
  )                 #   close non-matching group 1
  +                 #   non-matching group 1 one or more times
)                   # close group 1

// replacement
<ol>                # insert '<ol>'
$1                  # insert what is matched by the regex in group 1
</ol>               # insert '</ol>'

Now your string will contain:

<ol><p>1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </p>

<p>2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </p>

<p>3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </p></ol>

Next, replace all the paragraphs (including their numbers!) with <li> and </li> tags:

regex       : (?s)<p>\s*\d+\.((?:(?!</p>).)*)</p>
replacement : <li>$1</li>

Again, a short explanation:

// regex
(?s)               # enable DOT-ALL matching
<p>                # match '<p>'
\s*                # match zero or mroe white space characters
\d+                # match one or mroe digits
\.                 # math a dot
(                  # start group 1
  (?:(?!</p>).)*   #   [when looking ahead, if there's no '</p>', only then match any character] zero or more times
)                  # end group 1
</p>               # match '</p>'

// replacement
<li>               # insert '<li>'
$1                 # insert what is matched by the regex in group 1
</li>              # insert '</li>'

Now your string will look like:

<ol><li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno </li>

<li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Ac Nec Netus Penatibus Purus Cras Mollis </li></ol>

But again: be very very careful. When there's one little mistake in an opening or closing tag, you may very well end up with something that is far worse than what you've started with!

Bart Kiers 2009-10-22 07:40:47

Answer 3

A:

No, it is not feasible as a regular expression, because HTML is not a regular language.

Instead, take any HTML parser, find subsequent <p> nodes that are inside a common parent node and the contents of which begin with ordered numerals, and put them as <li> nodes into a new <ol> node.

Svante 2009-10-22 08:22:52

Answer 4

+1 A:

Not quite what you're asking for, but the HTML output from Microsoft Word has long been regarded by many as very poor, and many people have found themselves trying to clean it up. As a result, there are a good number of HTML-cleaning tools out there, and a quick search on Google suggests that the HTML Tidy Library Project, or others, may help you out. Don't reinvent the wheel unless you have to!

Tim 2009-10-22 08:37:24

ansaurus

tags:

views:

answers:

RegEx to convert Word output to html order list

related questions