tags:

views:

67

answers:

2

Looking to find the appropriate regular expression for the following conditions:

I need to clean certain tags within free flowing text. For example, within the text I have two important tags: <2004:04:12> and <name of person>. Unfortunately some of tags have missing "<" or ">" delimiter.

For example, some are as follows:

1) <2004:04:12 , I need this to be <2004:04:12>
2) 2004:04:12>, I need this to be <2004:04:12>
3) <John Doe , I need this to be <John Doe>

I attempted to use the following for situation 1:

String regex = "<\\d{4}-\\d{2}-\\d{2}\\w*{2}[^>]";
String output = content.replaceAll(regex,"$0>");

This did find all instances of "<2004:04:12" and the result was "<2004:04:12 >". However, I need to eliminate the space prior to the ending tag.

Not sure this is the best way. Any suggestions.

Thanks

A: 

Basically, you are looking for a negative look-ahead, like this:

String regex = "<\\d{4}-\\d{2}-\\d{2}(?!>)";
String output = content.replaceAll(regex,"$0>");

This will help with the numeric "tags", but since no regex can be intelligent enough to match an arbitrary name, you either must define very closely what a name can look like, or deal with the fact that the same approach is impossible for "name" tags.

Tomalak
thanks, this solved my problem.
Brian
A: 

For fixing the dates, you can match any date, with zero one or two angled brackets:

String regex = "(\\s?\\<?)(\\d{4}:\\d{2}:\\d{2})(\\>?\\s)";
String replace = " <$2> ";

To recognise a name, we assume parts of the name begin with a capital letter and the only separator is a space. We match the angled bracket explicitly at the start or end, and the preceeding/succeeding char before/after the name should be only a space or punctuation.

String regex = "(\\<[A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z])*)(?=[\\.!?:;\\s])";
String replace = "$1>";

String regex = "(?<=[\\.!?:;\\s])([A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z]*)*)";
String replace = "<$1";
mdma
This also, solved the issue. Cheers.
Brian