views:

159

answers:

2

Hi all,

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.

I am using grep to filter results like this:

grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...

The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).

All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:

public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";

I would like to find that occurrence as well as:

public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";

Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.

Any ideas?

Walter

A: 

I would use sed, not grep! Sed is used to perform basic text transformations on an input stream. Try s/regexp/replacement/ option with sed command.

You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.

The best solution will be however a simple script in Perl or in Python.

psihodelia
sed is what I ended up using. In fact it is very easy to use and once I figured out what regular expression I needed, everything fell into place.I simply daisy-chained my commands togethersed -e s/regexp/replacement/ -e ... -e ... | grep SOME_PATTERN > occurrences
+1  A: 

To address your concern about missing some occurrences, why not filter progressively:

  1. Create a text file with all possible matches as a starting point.
  2. Use filter X (grep for '^import', for example) to dump probable false positives into a tmp file.
  3. Use filter X again to remove those matches from your working file (a copy of [1]).
  4. Do a quick visual pass of the tmp file and add any real matches back in.
  5. Repeat [2]-[4] with other filters.

This might take some time, of course, but it doesn't sound like this is something you want to get wrong...

grossvogel
sounds like a possible winner.I was hoping to find a regular expression that was the magic/easy button.
I guess the question is what's more valuable to you: wasting an hour manually looking for possible false positives, or wasting an hour getting ripped a new one by your boss because your über-clever regexp missed some crazy convoluted corner case in the Java Language Specification.
Jörg W Mittag
I came from a mechanical engineering background, so I am aware that mistakes will occur ... I am trying to choose the path that will yield fewer mistakes and better results that are reproducible.A computer can do repetitive tasks without problem, humans on the other hand ... That is why computers exist.I can always tweak my regular expression, it only takes a minute to run; however, manually evaluating this can take days or weeks for the amount of content I'd have to go through and after a day or a few hours, I'm sure I might skip an occurrence or two here and there.