ansaurus

Question

Get the HTML value which is not in a tag in c#

Answer 1

+2 A:

You should use the HtmlAgilityPack and then get the text value of the row. That will eliminate all of the HTML elements in the snippet.

var doc = new HtmlDocument();
doc.LoadHtml( stringWithHtml );
var element = doc.DocumentNode.ChildNodes["tr"];
var text = element.InnerText;

Note that you may need to play around with the navigation to the desired element depending on your actual HTML.

tvanfosson 2010-10-24 14:30:03

Answer 2

A:

RegExp reg = new RegExp(@"<label\s*?for=""base_\d+?""\s+?style="".*?"">(.+?)</label>");
Match m = reg.Match("<tr ..>...</tr>"); // your text
string t = m.Groupp[0].Value;
Console.WriteLine(t);

signetro 2010-10-24 14:38:10

For anything other than the simplest HTML, you probably shouldn't be using regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

tvanfosson 2010-10-24 14:41:18

@tvanfosson - Will it not work even if I know there are only tags which I need to eliminate? As far as I understand my own question, if this will eliminate everything within the "<" and ">"... It should make it done..

Oren A 2010-10-24 14:48:39

@Oren - the point is that regexes are bad at parsing HTML because HTML is a context-free grammar, not a regular grammar (http://en.wikipedia.org/wiki/Formal_grammar). Also, even small changes to your HTML may break your code and force you to update your regular expression. When parsing HTML, it's almost always the best route to use an HTML parser.

tvanfosson 2010-10-24 14:54:23

As an example, this won't work if you have something like `<label style="margin-bottom: 3px; float: left" for="base_1001013">Nom d'utilisateur: </label>` (reordered attributes) or `<label style="margin-bottom: 3px; float: left" for="base_1001013">Nom d'utilisateur: </label>` (embedded comment); it may or may not do what you want if you have `<label ...>Nom <b>d'utilisateur</b>: </label>` (though that depends on your requirements); it won't handle `CDATA` blocks, extra attributes, etc. Parsing HTML is a solved problem. Regexen are not the solution.

Antal S-Z 2010-10-24 18:21:10

well, when it come to parse some text/html/xml (especially if you are not the owner of the text) and extract some text from it at some point your algorithm will fail. It doesn't matter if you use DOM, Regexp or XmlDocument.

signetro 2010-10-24 23:01:27

well, when it come to parse some text/html/xml (especially if you are not the owner of the text) and extract some text from it at some point your algorithm will fail. It doesn't matter if you use DOM, Regexp or XmlDocument.if the text will change in future like this <tr ..><moretags>some text <label...>Nom d'utilisateur</label>...</tr> then HtmlAgilityPack will give as: some text Nom d'utilisateurYou just can't make some code to be bullet proof when in come to data mining.

signetro 2010-10-24 23:08:15

ansaurus

tags:

views:

answers:

Get the HTML value which is not in a tag in c#

related questions