tags:

views:

63

answers:

2

I have a stringstream where it has many strings inside like this:

  <A style="FONT-WEIGHT: bold" id=thread_title_559960       href="http://microsoft.com/forum/f80/topicName-1234/"&gt;Beautiful Topic Name</A> </DIV>

I am trying to get appropriate links that starts with:

style="FONT-WEIGHT: bold

So in the end I will have the link:

http://microsoft.com/forum/f80/topicName-1234/

Topic Id:
    1234

Topic Display Name:
    Beautiful Topic Name

I am using this pattern, right now, but it doesn't do it all:
    "href=\"(?<url>.*?)\">(?<title>.*?)</A>"

Because there are other links that start with href.

Also to use Regex, I added all lines in a single line of string. Does regex care about new lines? IE can it continue to match for strings that span multiple lines?

Please help me with the pattern.

+1  A: 

In regular expression the dot wildcard does not match newlines. If you want to match any character including newlines, use [^\x00] instead of .. This matches everything except the null character, which means it matches everything.

Try this:

<A\s+style="FONT-WEIGHT: bold"\s+id=(\S+)\s+href="([^"]*)">([^\x00]*?)</A>

If you're trying to assign this to a string using double quotes, you'll need to escape the quotes and backslashes. It'll look something like this:

myVar = "<A\\s+style=\"FONT-WEIGHT: bold\"\\s+id=(\\S+)\\s+href=\"([^\"]*)\">([^\\x00]*?)</A>";
Asaph
Thansk, so it should be:"href=\"(?<url>*?)\">(?<title>*?)</A>"
Joan Venge
@Joan Venge: not quite. I updated my answer with a modification of your regex. Disclaimer: I didn't test it.
Asaph
Or `"href=\"(?<url>[^\"]*)\">(?<title>[^<]*)</A>"` in order to not let "title" match other tags (and thereafter an unrelated </a> tag).
jensgram
Thanks Asaph, can you please help me with other matches? I can't get my head around the fixed style font stuff string.
Joan Venge
@jensgram: He's already got the non-greedy qualifier after the title so the regex wouldn't swallow up unrelated </a> tags anyway.
Asaph
@Joean Venge: Are the attributes of the <a> tag always going to be in the same order? If you find your regex is becoming increasingly complicated in handling many different cases, you need to rethink your approach. Maybe it's time to look at HTML parsers?
Asaph
Yep, it's gonna be always the same order. It's just basic parsing I need for the links. Nothing major.
Joan Venge
@Joan Venge: For matching the contents of the style attribute you could try `style="([^"]*)"`. This pattern should work for other attributes (if you replace the word "style" with the appropriate attribute name, of course).
Asaph
Ok I tried it but regex throw an exception. Basically what I need is to match the href links that start with style="FONT-WEIGHT: bold", so in other words, only bold links.
Joan Venge
@Joan Venge: Ok, I updated the regex in my answer to look for only style="FONT-WEIGHT: bold" links. Go ahead and give that a try.
Asaph
Thanks Asaph, but do I need to escape the chars? I used @, but doesn't work.
Joan Venge
@Joan Venge: If you a plugging this in as a string, you may need to escape the backslashes. Try replacing \ with \\ everywhere in my regex
Asaph
Just did it, but still getting lots of red error lines in the editor.
Joan Venge
@Joan Venge: Also, try replacing " with \" everywhere in my regex. I think you'll be good after that.
Asaph
Thanks, now it works exactly you said. The only problem is, I lost the groups. How do I plug them in? As in, link, id, and threadName?
Joan Venge
(?<url>.*?) ? Yours is way different than these?
Joan Venge
@Joan Venge: I included capturing groups for id, href, and the visible link text between the `<A ...>` and `</A>`. You're regex should return 4 groups: 1 - The entire string match, 2 - the id attribute, 3 - the href attribute and 4 - the link text.
Asaph
@Joan Venge: What do you mean by <url>? I assumed that was a placeholder meaning any url. Is that not the case?
Asaph
Thanks, so I access the groups like this: match.Groups [ "id" ] ? Because they returned nothing. Yep url is the link address for the thread topic.
Joan Venge
@Joan Venge: The groups will be numbered, not referenced by strings. The groups should be indexed as follows: match.Group(1) = The entire string match, match.Group(2) = the id attribute, match.Group(3) = the href attribute and match.Group(4) = the link text.
Asaph
Thank you, everything works now.
Joan Venge
+1  A: 

You can make the . in a pattern match newlines by using the RegexOptions.Singleline enumeration:

Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

So if your title spanned multiple lines, with the option enabled the (?<title>.*?) part of the pattern would continue across lines attempting to find a match.

Ahmad Mageed