views:

36

answers:

2

I have an HTML string that has the following form:

<tr valign="top"><td colspan="2"  style="padding-bottom:5px;text-align: left"><label for="base_1001013" style="margin-bottom: 3px; float: left">Nom d'utilisateur:&nbsp;</label><span style="float: right;"><input class="PersonalDetailsClass" type="text" name="base_1001013" id="base_1001013" value="" /></span></td></tr>  

(sorry for the formatting..)

Anyhow I need to get the value which is not in a tag. i.e.- Nom d'utilisateur (without the "&nbsp", but that's neglectable).

The number of tags and their values may vary, also - the number of words in the requested string and even their language may also vary.

I'm not sure if that's a regex question, an XML question, or a c# string manipulation functions question (don't have specific preferences) .. But I do prefer not using a third-party dll (as I saw is sometimes used to parse HTML in c#).

How do I get the value?

Thanks.

+2  A: 

You should use the HtmlAgilityPack and then get the text value of the row. That will eliminate all of the HTML elements in the snippet.

var doc = new HtmlDocument();
doc.LoadHtml( stringWithHtml );
var element = doc.DocumentNode.ChildNodes["tr"];
var text = element.InnerText;

Note that you may need to play around with the navigation to the desired element depending on your actual HTML.

tvanfosson
A: 
RegExp reg = new RegExp(@"<label\s*?for=""base_\d+?""\s+?style="".*?"">(.+?)</label>");
Match m = reg.Match("<tr ..>...</tr>"); // your text
string t = m.Groupp[0].Value;
Console.WriteLine(t);
signetro
For anything other than the simplest HTML, you probably shouldn't be using regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
tvanfosson
@tvanfosson - Will it not work even if I know there are only tags which I need to eliminate? As far as I understand my own question, if this will eliminate everything within the "<" and ">"... It should make it done..
Oren A
@Oren - the point is that regexes are bad at parsing HTML because HTML is a context-free grammar, not a regular grammar (http://en.wikipedia.org/wiki/Formal_grammar). Also, even small changes to your HTML may break your code and force you to update your regular expression. When parsing HTML, it's almost always the best route to use an HTML parser.
tvanfosson
As an example, this won't work if you have something like `<label style="margin-bottom: 3px; float: left" for="base_1001013">Nom d'utilisateur: </label>` (reordered attributes) or `<label style="margin-bottom: 3px; float: left" for="base_1001013"><!-- old data</label> -->Nom d'utilisateur: </label>` (embedded comment); it may or may not do what you want if you have `<label ...>Nom <b>d'utilisateur</b>: </label>` (though that depends on your requirements); it won't handle `CDATA` blocks, extra attributes, etc. Parsing HTML is a solved problem. Regexen are not the solution.
Antal S-Z
well, when it come to parse some text/html/xml (especially if you are not the owner of the text) and extract some text from it at some point your algorithm will fail. It doesn't matter if you use DOM, Regexp or XmlDocument.
signetro
well, when it come to parse some text/html/xml (especially if you are not the owner of the text) and extract some text from it at some point your algorithm will fail. It doesn't matter if you use DOM, Regexp or XmlDocument.if the text will change in future like this <tr ..><moretags>some text <label...>Nom d'utilisateur</label>...</tr> then HtmlAgilityPack will give as: some text Nom d'utilisateurYou just can't make some code to be bullet proof when in come to data mining.
signetro