tags:

views:

217

answers:

4

I want to match and modify part of a string if following conditions are true:

I want to capture information regarding a project, like project duration, client, technologies used, etc..

So, I want to select string starting with word "project" or string may start with other words like "details of project" or "project details" or "project #1".

RegEx. should first look at word "project" and it should select the string only when few or all of the following words are found after word "project".

     1) client
     2) duration
     3) environment 
     4) technologies  
     5) role

I want to select a string if it matches at least 2 of the above words. Words can appear in any order and if the string contains ANY two or three of these words, then the string should get selected.

I have sample text given below.


Details of Projects : *Project #1: CVC – Customer Value Creation (Sep 2007 – till now) Time Warner Cable is the world's leading media and entertainment company, Time Warner Cable (TWC) makes coaxial quiver.

Client : Time Warner Cable,US. ETL

Tool : Informatica 7.1.4

Database : Oracle 9i.

Role : ETL Developer/Team Lead.

O/S : UNIX.

Responsibilities: Created Test Plan and Test Case Book. Peer reviewed team members > Mappings. Documented Mappings. Leading the Development Team. Sending Reports to onsite. Bug >fixing for Defects, Data and Performance related.

Details of Project #2: MYER – Sales Analysis system (Nov 2005 – till now) Coles Myer is one of Australia's largest retailers with more than 2,000 > stores throughout Australia,

Client : Coles Myer Retail, Australia. ETL Tool : Informatica 7.1.3 Database : Oracle 8i. Role : ETL Developer. O/S : UNIX. Responsibilities: Extraction, Transformation and Loading of the data using Informatica. Understanding the entire source system.
Created and Run Sessions and Workflows. Created Sort files using Syncsort Application.*

Does anyone know how to achieve this using regular expressions? Any clues or regular expressions are welcome!

Many thanks!

+1  A: 

I would break it down into a few simpler regex's to get these results. The first would select only the chunk of text between projects: (?=Project #).*(?<=Project #)
With the match that this produces, i would run a seperate regex to ask if it contains any of those words : client | duration | environment | technologies | role If this match comes back with a count of more then 2 distinct matches, you know to select the original string!

Edit:

string originalText;
MatchCollection projectDescriptions = Regex.Matches(originalText, "(?=Project #).(?:(?!Project #).)*", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Foreach(Match projectDescription in projectDescriptions)
{
  MatchCollection keyWordMatches = Regex.Matches(projectDescription.value, "client | duration | environment | technologies | role ", RegexOptions.IgnoreCase);
  if(keyWordMatches.Distinct.Count > 2)
  {
    //At this point, do whatever you need to with the original projectDescription match, the Match object will give you the index etc of the match inside the original string.
  }
}
xoxo
Thanks xoxo for help.Can you please describe how to break down this operation into different regex's? or give name of the book or link to tutorial or some help guide which explains this?As I am new in regex area, I have little knowledge about how to do it. I am using C# for this.Thanks for help.
Shekhar
I hope the edit helps?
xoxo
A: 

Try

^(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*$

One note: This will also match if only one of the terms appears twice.

In C#:

foundMatch = Regex.IsMatch(subjectString, @"\A(?:(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*)\Z", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Tim Pietzcker
Sorry. I am using C# for this thing.The query which you have given fails at only one point.If String starts like "Details of project" then RE given by you wont work.Actually I had created same RE which u have given. How to handle this problem? Can we really solve this problem using RE or do we need to use some other approach?
Shekhar
Well, what do you want? You said the string should start with "Project". Your own example string doesn't. What exactly is the condition that the string needs to match? Please edit your question to clarify.
Tim Pietzcker
@Tim, Im sorry if I have confused you ppl.In the text that I have, project related information can start with "project" word itseld or it may start with "Details of project".I want to select project related information (like client, role, enviornment, technologies used) in both the cases.
Shekhar
OK, I edited my answer. Hint: Edit your question, too, or other people will continue to give answers that won't work.
Tim Pietzcker
Thanks.I will edit my question.
Shekhar
OK, now suddenly you have two projects in one string. By now I think that one single regex won't cut it. No fair downvoting my answer though, if specifications change all the time...
Tim Pietzcker
A: 

Maybe you need to break that requirements in two steps: first, take your key/value pairs from your string, than apply your filter.

string input = @"Project #...";
Regex projects = new Regex(@"(?<key>\S+).:.(?<value>.*?\.)");
foreach (Match project in projects.Matches(input))
{
    Console.WriteLine ("{0} : {1}", 
        project.Groups["key"  ].Value, 
        project.Groups["value"].Value);
}
Rubens Farias
Shekhar
+2  A: 
(client|duration|environment|technologies|role).+(client|duration|environment|technologies|role)(?!\1)
theraccoonbear