tags:

views:

211

answers:

5

At the moment I am trying to match patterns such as

text text date1 date2

So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.

Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.

Anyone got any better solutions?

Update

Here is a small example of the pattern I would need to match:

FIND [email protected] 01/01/2010 to 10/01/2010

Here is my current regex:

FIND [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}

This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.

I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...

A: 

I would split the string into a string array and match each resulting string to the necessary Regular Expression.

Steve Danner
Why stop there? `</sarcasm>`
Wim Hollebrandse
+2  A: 

To match at least one or more whitespace characters (space, tab, newline), use:

\s+

Substitute the above wherever you have the physical space in your pattern and you should be fine.

Wim Hollebrandse
I think this might be all I am missing actually. I will adapt my regex and see if it works!
James
Is there way to determine and detect which type of whitespace you have found?
James
Thanks this seemed to do the trick.
James
A: 
\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b
ChaosPandion
A: 

Its a nasty expression but here is something that will work for the input you provided:

^(\w+)\s+([\w@.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$

This will work with variable amounts of whitespace between the capture groups as well.

Andrew Hare
Incorrect I'm afraid, `\w` and `@` as well as any whitespace character (except `\n`) and the subsequent digits will be matched by `.`, basically greedy matching. Use the `?` suffix for non-greedy matching.
Wim Hollebrandse
The `.` is a member of the character class so it does not represent a metacharacter, rather the literal value `"."`
Andrew Hare
+2  A: 

Example of matching multiple groups in a text with multiple whitespaces and/or newlines.

var txt = "text text   date1\ndate2";
var matches = Regex.Match(txt, @"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);

matches.Groups[n].Value with n from 1 to 4 will contain your matches.

Jonas Elfström