I'm in need of a tricky regex and I don't know if it can be written.
I'm trying to clean up some horrid html output from Ms Word. Here's an exmaple of the dandy that it does on an ordered (or numbered) list.
<p>
1.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi</p>
<p>
2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</p>
<p>
3.
Ac Nec Netus Penatibus Purus Cras Mollis</p>
Beautiful, isn't it? Paragraph tags and nonbreaking spaces...
I'm wondering if it's even feasible to write a regex to replace this with the following:
<ol>
<li>
1.
Proin Facilisi Habitasse Hymenaeos Ligula Litora Luctus Mi</li>
<li>
2.
Nulla Auctor Bibendum Suspendisse Commodo Cras Cursus Anno</li>
<li>
3.
Ac Nec Netus Penatibus Purus Cras Mollis</li>
</ol>
The difficulty is that the number of s
can vary from none to just a few to a lot and a list can be of varying lengths. Having no s
seems to be rare, and it seems to happen only after a list gets larger (say when going from 9 to 10 or 99 to 100.)
Anyway, if such a thing is possible, that would be awesome. As it stands, I can search for long strings of s
and then manually apply list formatting, but it's not as fast as automatic.