views:

55

answers:

1

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.

cat x dog cat x cat x dog x dog x cat x dog x cat

Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').

What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.

In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':

cat x dog [cat] x cat x dog x dog x cat x dog x cat

To end up with:

cat x dog x cat x dog x dog x cat x dog x cat

How can this be done?

I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended here

(cat)(?=(.*cat))

to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.


The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.

+1  A: 

Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.

If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.

Pseudocode follows:

// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);

// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
  string.removeSubstring(id, "cat".length());
zigdon