ansaurus

Question

Regular expression - Text between colons

Answer 1

+2 A:

Try this:

<br/>([^:]+):

Gumbo 2009-10-13 09:57:13

Yes. First, extract the contents of the `<h1>` element with an HTML parser, then apply this regular expression.

Svante 2009-10-13 10:02:51

Answer 2

+1 A:

I believe this will work:

<h1>.*?<br />([^:]+):(.*?)</h1>

Aistina 2009-10-13 10:00:20

May be I'm wrong, but what is the need of `?` in `<h1>.*?` Doesn't `.*` mean zero or more occurrences of any character?

Amarghosh 2009-10-13 10:06:06

Yes, but it will also match `<br />` which becomes a problem if there is more than one `<br />` inside the h1 tag, or if there is more than one h1 tag inside the string you're applying the regex to. The ? makes the preceding quantifier lazy, ensuring that the rest matches at the earliest possible location.

Tim Pietzcker 2009-10-13 10:13:00

? means non-greedy (i.e. match as little as possible - default is match as much as possible). If not used and the page contains two <h1> tags it will get the <h1> from the first and the </h1> from the last.

rslite 2009-10-13 10:14:06

Thanks all. Greediness is a tricky thing. Googling gave me this http://www.regular-expressions.info/repeat.html Everyday I'm learning something about regex just by hanging out here in SO !!!

Amarghosh 2009-10-13 10:22:30

Thank you so much! You are my personal hero...

snarebold 2009-10-13 11:18:02

but how can i acces to the first part? with asp.net and c#?i try string RegexGeschaeftstyp = @"<h1>.*?<br>([^:]+):(.*?)</h1>"; MatchCollection RegexGeschaeftstypMatches = Regex.Matches(strSource, RegexGeschaeftstyp); foreach (Match match1 in RegexGeschaeftstypMatches) { Response.Write("Found " + match1.ToString() + " at position " + match1.Index + ".<br>"); }

snarebold 2009-10-13 11:46:46

Did a quick Google, something like that should be right. Maybe match1.Value instead of match1.ToString()... not really sure, I've never used Regex in C#.

Aistina 2009-10-13 12:21:57

Answer 3

+1 A:

Think about what you mean and translate that into the regex language. As Gumbo has pointed out, you should be using [^:] instead of .; the reason for this is that you are looking for groups of characters that aren't colons ([^:]), not for groups of absolutely any character at all[1] (.) which happen to have colons between them.

Any time you find yourself using . with a quantifier in a regex, stop and ask yourself whether you really mean "any character" or whether you could express your meaning more clearly (and get more accurate results) using a character class instead.

(Non-greedy quantifiers (.*?) can also do the job of getting correct matches in cases like this, but character classes are still a clearer expression of intent for human readers and improve efficiency by avoiding excessive backtracking for machine readers.)

[1] Well, absolutely any character at all, with the possible exception of newlines depending on the regex implementation that you're using.

Dave Sherohman 2009-10-13 10:05:37

Answer 4

A:

my brain's floooding. really thanks to all who already helped.

may be anyone can try to help again is so important for me :S?

<ul>
<li>
07.05.2009:
<a href="#1">Test 1</a>
</li>
<li>
05.01.2009:
<a href="#2">Test 2</a>
</li>
</ul>

This Time I like to read the second part. The best thing would be, if I get both seperate in one regex..

So: 1. 07.05.2009 2. Test 1

snarebold 2009-10-14 06:35:46

This auxilliary question is now a separate question: http://stackoverflow.com/questions/1564665/regular-expression-exclude-not-needed

Jonathan Leffler 2009-10-14 07:08:33

Please don't post questions as an answer.

Jonas Elfström 2009-10-14 07:08:47

ansaurus

tags:

views:

answers:

Regular expression - Text between colons

related questions