tags:

views:

193

answers:

4

I need to extract certain Abbreviations from a file such as ABS,TVS,and PERL. Any abbreviations which are in uppercase letters. I'd preferably like to do this with a regular expression. Any help is appreciated.

+2  A: 

Untested:


my %abbr;
open (my $input, "<", "filename")
  || die "open: $!";
for ( < $input > ) {
  while (s/([A-Z][A-Z]+)//) {
    $abbr{$1}++;
  }
}

Modified it to look for at least two consecutive capital letters.

Marius Kjeldahl
no need to substitute there, nor to read in the whole file before processing any (though you've got a bug: that's a glob(), not a readline(), due to the extra spaces).
ysth
You're probably right, but the editor didn't allow it without the spaces. I suspect the "lt dollar" sequence got cut out without the spaces.
Marius Kjeldahl
You need to tell the editor that you're in charge - or perhaps get a different editor.
Telemachus
+4  A: 

It would have been nice to hear what part you were particularly having trouble with.

my %abbr;
open my $inputfh, '<', 'filename'
    or die "open error: $!\n";
while ( my $line = readline($inputfh) ) {
    while ( $line =~ /\b([A-Z]{2,})\b/g ) {
        $abbr{$1}++;
    }
}

for my $abbr ( sort keys %abbr ) {
    print "Found $abbr $abbr{$abbr} time(s)\n";
}
ysth
+2  A: 
#!/usr/bin/perl

use strict;
use warnings;

my %abbrs = ();

while(<>){
    my @words = split ' ', $_;

    foreach my $word(@words){
        $word =~ /([A-Z]{2,})/ && $abbrs{$1}++;
    }
}

# %abbrs now contains all abreviations
dsm
Missing a `$word=~` there. For kicks, you could say: `$word =~ y/A-Z//c or $abbrs{$word}++;`.
ysth
well spotted, thanks
dsm
i need to extract only...abbreviations like ABC or BAV for example i have also like ABC123,CMV002 in my document it also extracts that... i just want to extract only ABC and CMV in this case.. can you help me?
lokesh
OK, changed it so it does that
dsm
Alternatively, if the numbers always come after the abbreviation, you can use /^([A-Z]+)[0-9]*$/
dsm
i have a problem this /^([A-Z]+)[0-9]*$/ extracts even digits at starting... say for ex017_ABC_EFG....
lokesh
+3  A: 

Reading text to be searched from standard input and writing all abbreviations found to standard output, separated by spaces:

my $text;
# Slurp all text
{ local $/ = undef; $text = <>; }
# Extract all sequences of 2 or more uppercase characters
my @abbrevs = $text =~ /\b([[:upper:]]{2,})\b/g;
# Output separated by spaces
print join(" ", @abbrevs), "\n";

Note the use of the POSIX character class [:upper:], which will match all uppercase characters, not just English ones (A-Z).

Lars Haugseth
Put `\b` at the beginning and end.
Brad Gilbert
Good idea, I've updated my answer.
Lars Haugseth