word document parsing after html conversion

I have used examples from threads here on how to open and convert word documents to html in order to parse them. I got it all working great using the office interop library but used an example word document with some text in it and it worked fine. Now with actual word documents that I need to parse that come in all types of formatting and irregular formats I got it to convert to html all fine. But the actual html when looking at seems very hard to parse with regular expressions. For example look at the following: I am trying to extract the bold words. Any ideas will be really appreciated. My reg expression class=.(?.?). is what I was initially trying. I tried posting the code but it does not show so :

<p class=MsoNormal style='margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt;
margin-left:.2in;text-align:justify;text-indent:-.15in;line-height:12.0pt;
mso-list:l4 level1 lfo12;tab-stops:list 0in'><![if !supportLists]><span
style='font-size:10.0pt;font-family:Symbol;mso-fareast-font-family:Symbol;
mso-bidi-font-family:Symbol'><span style='mso-list:Ignore'>·<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp; </span></span></span>    
style='mso-bidi-font-weight:normal'><span style='font-size:10.0pt;font-family:
"Verdana","sans-serif";color:black'>Databases</span></b><span style='font-size:
10.0pt;font-family:"Verdana","sans-serif";color:black'>: **MS SQL Server 2005, MS
SQL Server 2000**<o:p></o:p></span>

So the word MS SQL Server are the type of words I would like to parse, basically english words that are not used in the formatting text.

ansaurus

tags:

views:

answers:

word document parsing after html conversion

related questions