views:

54

answers:

3

Apologies if this was answered elsewhere -- I did some searching and couldn't find the answer.

Suppose I have a text file that contains a bunch of content. In that content is an occupation code, which is always in the format of a number followed by a capital letter.

How can I extract ONLY the occ codes from the file? In plain english, I want to remove everything in the file that does not match the number-capital_letter pattern.

+5  A: 

You could match using /(\d+[A-Z])/

gnarf
Indeed. Rather than removing everything else, in this case it's simpler just to match what you want and spit that out.
Kevin Ballard
A: 

Writing a script that scans through line by line or word by word depending on the how the occ codes appear in the file and checking for matches possibly using a REGEX then wrting them to another file is a simple solution.

You COULD use a single regex match on the entire document and iterate over the results but that could pose problems depending on the size of the file.

Derek Litz
A: 

Here's a crude attempt to remove everything except the desired codes using sed. (Note that I interpret "number" to mean a string of one or more digits, no decimal point or leading minus sign.)

sed -e 's/\([A-Z]\)[0-9]*/\1/g' -e 's/[0-9]*[^0-9A-Z]*//g' -e 's/[0-9]*$//' -e '/^$/d' < filename

The first command removes anything after a capital letter that isn't a number (and therefore perhaps the beginning of another code), the second removes any number followed by something other than a capital letter, the third removes trailing numbers and the fourth removes blank lines.

I've run some tests and this seems to work pretty well. I'll happily amend it if anyone can find a case where it fails.

Beta