tags:

views:

114

answers:

4

I have a schema like this

<h1>
5/2009
<br/>
Question: This is the question
</h1>

I like to get the first part after the <br/> or always the string before the colon :

--> Solution should be "Question"

Attention: This words change - Sometimes its question, othertime may be big question ....

I tried with <h1>(.{0,50}):(.{0,50}) but this returns to much (also the date)

I'm not trained with regex, can anyone help me with this?

Thank you alot.

+2  A: 

Try this:

<br/>([^:]+):
Gumbo
Yes. First, extract the contents of the `<h1>` element with an HTML parser, then apply this regular expression.
Svante
+1  A: 

I believe this will work:

<h1>.*?<br />([^:]+):(.*?)</h1>
Aistina
May be I'm wrong, but what is the need of `?` in `<h1>.*?` Doesn't `.*` mean zero or more occurrences of any character?
Amarghosh
Yes, but it will also match `<br />` which becomes a problem if there is more than one `<br />` inside the h1 tag, or if there is more than one h1 tag inside the string you're applying the regex to. The ? makes the preceding quantifier lazy, ensuring that the rest matches at the earliest possible location.
Tim Pietzcker
? means non-greedy (i.e. match as little as possible - default is match as much as possible). If not used and the page contains two <h1> tags it will get the <h1> from the first and the </h1> from the last.
rslite
Thanks all. Greediness is a tricky thing. Googling gave me this http://www.regular-expressions.info/repeat.html Everyday I'm learning something about regex just by hanging out here in SO !!!
Amarghosh
Thank you so much! You are my personal hero...
snarebold
but how can i acces to the first part? with asp.net and c#?i try string RegexGeschaeftstyp = @"<h1>.*?<br>([^:]+):(.*?)</h1>"; MatchCollection RegexGeschaeftstypMatches = Regex.Matches(strSource, RegexGeschaeftstyp); foreach (Match match1 in RegexGeschaeftstypMatches) { Response.Write("Found " + match1.ToString() + " at position " + match1.Index + ".<br>"); }
snarebold
Did a quick Google, something like that should be right. Maybe match1.Value instead of match1.ToString()... not really sure, I've never used Regex in C#.
Aistina
+1  A: 

Think about what you mean and translate that into the regex language. As Gumbo has pointed out, you should be using [^:] instead of .; the reason for this is that you are looking for groups of characters that aren't colons ([^:]), not for groups of absolutely any character at all[1] (.) which happen to have colons between them.

Any time you find yourself using . with a quantifier in a regex, stop and ask yourself whether you really mean "any character" or whether you could express your meaning more clearly (and get more accurate results) using a character class instead.

(Non-greedy quantifiers (.*?) can also do the job of getting correct matches in cases like this, but character classes are still a clearer expression of intent for human readers and improve efficiency by avoiding excessive backtracking for machine readers.)

[1] Well, absolutely any character at all, with the possible exception of newlines depending on the regex implementation that you're using.

Dave Sherohman
A: 

my brain's floooding. really thanks to all who already helped.

may be anyone can try to help again is so important for me :S?

<ul>
<li>
07.05.2009:
<a href="#1">Test 1</a>
</li>
<li>
05.01.2009:
<a href="#2">Test 2</a>
</li>
</ul>

This Time I like to read the second part. The best thing would be, if I get both seperate in one regex..

So: 1. 07.05.2009 2. Test 1

snarebold
This auxilliary question is now a separate question: http://stackoverflow.com/questions/1564665/regular-expression-exclude-not-needed
Jonathan Leffler
Please don't post questions as an answer.
Jonas Elfström