tags:

views:

92

answers:

4

Hello, I need to parse a html file and extract the NeedThis* strings with C#/.net, sample code is:

<tr class="class">
    <td style="width: 120px">
        <a href="NeedThis1">NeedThis2</a>
    </td>
    <td style="width: 120px">
        <a href="NeedThis3">
            NeedThis4</a>
    </td>
    <td style="width: 30%">
        NeedThis5
    </td>
    <td>
        NeedThis6
    </td>
    <td style="width: 120px">
        NeedThis7
    </td>
</tr>

I know a html parser should be better here, but all I need is to extract these texts, this is just for a temp helper tool...

anyone can help me with this?

thanks!

A: 

You seem to have answered your own question. You should use a parser. But if you don't you can use the RE NeedThis.*

Of course, if you want any context with those strings, you should just use a parser.

JoshD
actually, NeedThis can be any arbitraty string...
Hans W
In that case, **USER A PARSER**
JoshD
@JoshD **NO!!**
Hans W
@Hans W Glad to see you proving that programmers are still as naturally resistant to good ideas as ever.
jball
+2  A: 

If you are sure that you html is valid you could use Linq to Xml else you are better of using a parser like HTML Agility Pack

Vinay B R
+2  A: 

It doesn't matter whether you're doing this for a one-off or for a "finished project". Your task isn't text extraction and it's not something that a regex can do effectively. The data you're looking for depends on the structure of the HTML. Your task is parsing HTML. When your task is parsing HTML, use an HTML parser. It's not difficult. In fact it's a lot easier than writing the pile of regexes you would need otherwise.

hobbs
A: 

Hans, as you can see by the other answers using a RegEx is probably not the best way to do what you want to do, but since I need to practice my RegEx anyways I went ahead and made one just in case you wanted to experiment. This will only catch NeedThis2, but it should give you an idea of how you would make your own RegEx when it is an appropriate solution.

<a href="NeedThis1">NeedThis2</a>

RegEx to catch NeedThis2:

(?:<a[^<a]+?>)(\S)*(?:<[^<]+?a>)

Pretty nasty huh? :)

typoknig