




I have a schema like this

Question: This is the question

I like to get the first part after the <br/> or always the string before the colon :

--> Solution should be "Question"

Attention: This words change - Sometimes its question, othertime may be big question ....

I tried with <h1>(.{0,50}):(.{0,50}) but this returns to much (also the date)

I'm not trained with regex, can anyone help me with this?

Thank you alot.

+2  A: 

Try this:

Yes. First, extract the contents of the `<h1>` element with an HTML parser, then apply this regular expression.
+1  A: 

I believe this will work:

<h1>.*?<br />([^:]+):(.*?)</h1>
May be I'm wrong, but what is the need of `?` in `<h1>.*?` Doesn't `.*` mean zero or more occurrences of any character?
Yes, but it will also match `<br />` which becomes a problem if there is more than one `<br />` inside the h1 tag, or if there is more than one h1 tag inside the string you're applying the regex to. The ? makes the preceding quantifier lazy, ensuring that the rest matches at the earliest possible location.
Tim Pietzcker
? means non-greedy (i.e. match as little as possible - default is match as much as possible). If not used and the page contains two <h1> tags it will get the <h1> from the first and the </h1> from the last.
Thanks all. Greediness is a tricky thing. Googling gave me this http://www.regular-expressions.info/repeat.html Everyday I'm learning something about regex just by hanging out here in SO !!!
Thank you so much! You are my personal hero...
but how can i acces to the first part? with asp.net and c#?i try string RegexGeschaeftstyp = @"<h1>.*?<br>([^:]+):(.*?)</h1>"; MatchCollection RegexGeschaeftstypMatches = Regex.Matches(strSource, RegexGeschaeftstyp); foreach (Match match1 in RegexGeschaeftstypMatches) { Response.Write("Found " + match1.ToString() + " at position " + match1.Index + ".<br>"); }
Did a quick Google, something like that should be right. Maybe match1.Value instead of match1.ToString()... not really sure, I've never used Regex in C#.
+1  A: 

Think about what you mean and translate that into the regex language. As Gumbo has pointed out, you should be using [^:] instead of .; the reason for this is that you are looking for groups of characters that aren't colons ([^:]), not for groups of absolutely any character at all[1] (.) which happen to have colons between them.

Any time you find yourself using . with a quantifier in a regex, stop and ask yourself whether you really mean "any character" or whether you could express your meaning more clearly (and get more accurate results) using a character class instead.

(Non-greedy quantifiers (.*?) can also do the job of getting correct matches in cases like this, but character classes are still a clearer expression of intent for human readers and improve efficiency by avoiding excessive backtracking for machine readers.)

[1] Well, absolutely any character at all, with the possible exception of newlines depending on the regex implementation that you're using.

Dave Sherohman

my brain's floooding. really thanks to all who already helped.

may be anyone can try to help again is so important for me :S?

<a href="#1">Test 1</a>
<a href="#2">Test 2</a>

This Time I like to read the second part. The best thing would be, if I get both seperate in one regex..

So: 1. 07.05.2009 2. Test 1

This auxilliary question is now a separate question: http://stackoverflow.com/questions/1564665/regular-expression-exclude-not-needed
Jonathan Leffler
Please don't post questions as an answer.
Jonas Elfström