tags:

views:

51

answers:

3

I have a string in c# containing some data i need to extract based on certain conditions.

The string contains many tenders in the following form :

<TENDER> some words, don't know how many, may contain numbers and things like slashes (/) or whatever <DESCRIPTION> some more words and possibly other things like numbers or whatever describing the tender here </DESCRIPTION> some more words and possibly numbers and weird things </TENDER>

This string doesn't contain any nested <TENDER> tags, its flat. The <DESCRIPTION> tags occur only once within the <TENDER> tags.

I'm using : <TENDER>(.+?)</TENDER> as the regex to split up the tenders and it works fine. If this is wrong or stupid and you know a better way to write this please let me know as I have discovered I suck at regex.

My problem that I now need to only select a tender if its description contains any word in a list of keywords (lets say for now i want to select a tender only if it contains either "concrete" or"brick" in the description).

So far the regex I have come up with looks like this, but I don't know what to put in the middle. Also I have a vague suspicion that this might return me some false positives.

<TENDER>(.+?)<DESCRIPTION>have no idea what to do here</DESCRIPTION>(.+?)</TENDER>

If any of you regex guru's could point me in the right direction I would be most appreciative.

+2  A: 

Use

<TENDER>([^<>]+?)<DESCRIPTION>[^<>]*?(brick|concrete)[^<>]*?</DESCRIPTION>([^<>]+?)</TENDER> 

I am using [^<>] instead of . to avoid leaving the tags.

Jens
I have tried this before. This will return me tenders with the words brick or concrete even if the words are outside of the description tags, according to expresso.
spaceman
Ah, true. That is because `.` matches even the tags. Editing.
Jens
Sorry, now I'm getting no results..
spaceman
What input are you using? It works for me. Your example does not match this regex, because there is no "brick" in the description. I added one, and it matches.
Jens
hahahaha, sorry, you are correct. i am being a dumbass.thank you for your help and speedy responses. :D
spaceman
A: 

Instead of regex, try using a proper DOM parsing library, such as the Html Agility Pack. It should work with any tags, even custom ones.

Dan Diplo
+1  A: 

Use IgnorePatternWhiteSpace because I have commented the pattern. It does not affect the data processing...it allows one to break out patterns and comment.

string pattern = @"
(?<=<TENDER>)            # Look Behind for TENDER
(?<TenderBefore>.*?)     # Put the data into the TenderBefore Named Match Capture Group
(?:<DESCRIPTION>)
(?=.*brick|concrete)     # Look ahead for the keywords
(?<Description>.*?)      # Put the data into the Description NMCG
(?:</DESCRIPTION>)
(?<TenderAfter>.*?)      # Put text into NMCG TenderAfter
(?=<\/TENDER>)           # Tender Look ahead.";

After processing the matches, extract the data out of each match such as

string Tender = string.Format("{0}<DESCRIPTION>{1}</DESCRIPTION>{2}",
 myMatch.Groups["TenderBefore"].Value,
 myMatch.Groups["Description"].Value,
 myMatch.Groups["TenderAfter"].Value);

HTH

OmegaMan