tags:

views:

90

answers:

3

I am very weak with Regex and need help. Input looks like the following:

<span> 10/28 &nbsp;&nbsp;Currency:&nbsp;USD

Desired output is 10/28.

I need to get all text between the <span> and "Currency:" that are numbers, a "/" character, or a ":" character. No spaces.

Can you help? Thanks.

+1  A: 

Here's a good place to start. Using others code is fine at first, but if you don't learn this stuff you're going to be eternally doomed to asking questions every time you need a new regex.

Mastering Regular Expressions

Regular Expressions Cookbook

Online tutorial

spend some time, learn the basics, and pretty soon you'll be helping us with our regex problems.

Robert Greiner
I'm in a major crunch, if I had the time to learn it now I wouldn't be asking questions.
Bob
that's cool, read this answer again when you have some free time :P
Robert Greiner
-1 Sorry, but saying "learn regular expressions" isn't answering his specific question.
CAbbott
Reading the first book at the moment, and it's great!
Andomar
@Jeremy, thanks for adding the link.
Robert Greiner
@CAbbott no problem, you aren't wrong about that by any means. I still think it needs to be addressed regardless. Thanks for your comment.
Robert Greiner
@Andomar yeah, I feel alot more comfortable with regular expressions after reading that book.
Robert Greiner
Thanks for plugging Regular Expressions Cookbook. If anyone doesn't have a copy yet, O'Reilly and I are doing a giveaway at http://www.regexguru.com in which anyone can participate until the end of the month (28 Feb 2010).
Jan Goyvaerts
+3  A: 

Updated: What you're describing is three parts.

What we do want is one or more characters that are digits, forward slash, and :: [0-9/:]* (the asterisk means "zero or more instances"). Surrounded by:

  • <span>(optional stuff we don't want) is represented as: <span>[^0-9/:]*
  • (optional stuff we don't want)Currency is: [^0-9/:]*Currency

(The ^ means "not") - so this will essentially match any number of characters which is not the bits we want, including things like &nbsp;

In c#:

string pattern = @"<span>[^0-9/:]*(?<value>[0-9/:]*)[^0-9/:]*Currency";
Match match = Regex.Match(input, pattern, RegexOptions.SingleLine | RegexOptions.IgnoreCase);
string output = match.Groups["value"].Value;
Rex M
Thanks, but this didn't work.
Bob
@Bob sorry, I misread part of your question. I edited to fix it.
Rex M
Still not working, here's an actually sample of the text:<span> 10/28   Currency: USD
Bob
@Jeremy correcting errors is always appreciated, but don't get too edit-happy just removing correct but superfluous text.
Rex M
@Bob you didn't mention you had non-breaking space entities in there! You just said space :D
Rex M
Yeah, I thought `[\s]` over `\s` was setting a bad example, but that was a bit pedantic. Sorry. Also, I was wrong about the colon after Currency not being part of the input. So, wrong on two counts. I'll stop now.
Jeremy Stein
Sorry! Any ideas?
Bob
thanks so much!!
Bob
+1  A: 

Try this regular expression:

<span>(?>.*?([\d/:]+)).*?Currency

The .*? matches the least amount of anything (non-greedy regex.) It should work for your example <span> 10/28 &nbsp;&nbsp;Currency:&nbsp;USD.

This is a nice site to test regular expressions.

Andomar
This regex will suffer from catastrophic backtracking when `<span>` is not followed by `Currency`.
Jeremy Stein
@Jeremy Stein: Tested and New York City wasn't flooded. What do you mean by catastrophic?
Andomar
http://www.regular-expressions.info/catastrophic.html In this case, I guess it's only a problem if there are a lot of digits in the subject. Try it on the source of this page, or on `<span>5555...` where there are about 100 5s. Anyhow, there's an easy fix, which I'll make for you now.
Jeremy Stein
Hmm.. That must be chapter 5.
Andomar