tags:

views:

184

answers:

1

Been banging my head against the wall on this one all day and am getting close to my wits end on this. Looking for some fresh perspective.

Sample Input Text:
(line breaks added for clarity, not in actual data )

</div>#My Novel<br />  
##Chapter1<br />  
It was a dark and stormy night<br />
##Chapter 2<br />
The End

Desired Output

</div><h1>My Novel</h1><br />
<h1>Chapter1</h1><br />  
It was a dark and stormy night<br />  
<h1>Chapter 2</h1><br />  
The End

Actual Output

</div><h1>My Novel</h1><br />
##Chapter1<br />  
It was a dark and stormy night<br />  
<h1>Chapter 2</h1><br />  
The End

Here is the match expression
(formatted for easy reading, comments/linebreaks are not in expression)

(?<preamble>
    (                             
     ([<]\/\w+\d*[>])|([<]\w+\d*\s*\/[>])   #</tag> or <tag />
    )
    \s*  #optional whitespace              
)

(?<hashmarks>
    \#{1,6}      #1-6 hash marks
)    

(?<content>
    .+?          #header content
 )   

(?<closing>
    ([<](br|\/\s*br|br\s*\/)[>])   #<br>,</br>, or <br />
)

Here is the replace Expression

${preamble}<h1>${content}</h1>${closing}

If it matters I am using the following C# regex.replace overload:

Regex.Replace(Source,SrchExp,ReplExpr,RegexOptions.IgnoreCase)

The question (finally)
Can anyone see why it is replacing #My Novel and ##Chapter 2, but not ##Chapter 1?

Sorry for the long post, and hopefully I didn't munge anything trying to format it to make it readible for SO.

Update:

One more thing that might help. Adding an extra break tag right after "Novel" makes the provided code start working perfectly. No idea why yet.

Sample Input Text (modified):

</div>#My Novel<br /><br />
##Chapter1<br />  
It was a dark and stormy night<br />
##Chapter 2<br />
The End
+2  A: 

Here's one that was actually tested and appears to work.

The issue is that once a match is found, the search continues exactly where the first one left off. As a result, the closing <br /> of #My Novel will not be captured again, and so #Chapter1 is missed.

To capture #Chapter1-like constructs anyway, we can use a lookbehind assertion. Lookbehinds enforce the presence of the prefix, even if it extends before the current position. This also prevents the need to drop it in the replacement string:

  • Replace (?<preamble> with (?<=

  • Then in the replacement string, remove the ${preamble} portion.

The overall search expression now looks like:

(?<=             # removed the preamble capture and replaced with a lookbehind
    (                             
        ([<]\/\w+\d*[>])|([<]\w+\d*\s*\/[>])   #</tag> or <tag />
    )
    \s*  #optional whitespace                               
)

(?<hashmarks>
    \#{1,6}      #1-6 hash marks
)    

(?<content>
    .+?          #header content
 )      

(?<closing>
    ([<](br|\/\s*br|br\s*\/)[>])   #<br>,</br>, or <br />
)

And the replacement string looks like:

<h1>${content}</h1>${closing}

Our output is now faithfully:

</div><h1>My Novel</h1><br />
<h1>Chapter1</h1><br />
It was a dark and stormy night<br />
<h1>Chapter 2</h1><br />
The End
Oren Trutner
You are the man! The lookbehind assertion worked like a charm.
JohnFx
You also ought to be able to replace `(?<closing>` with a look*ahead* assertion: `(?=`
Ben Blank