views:

227

answers:

4

Hello,

I have to find all initializations (captial letter words, such as SAP, JSON or XML) in my plain text files. Is there any ready-made script for this? Ruby, Python, Perl - the language doesn't matter. So far, I've found nothing.

Regards,

Stefan

+2  A: 

A regular expression like /[A-Z]{2,}/ should do the trick.

Brian Rasmussen
That matches also strings like fooBARbaz. And it doesn't support international character sets.
Juha Syrjälä
Correct but the OP said "plain text", so I figure this would be good enough. If the text does contain words like that you need to enclose the pattern in \b tags and if an international character set is used additional characters must be added.
Brian Rasmussen
+17  A: 

Here you go:

perl -e 'for (<>) { for (m/\b([[:upper:]]{2,})\b/) { print "$1\n"; } }' textinput.txt

Grabs all all-uppercase words that are at least two characters long. I use [[:upper:]] instead of [A-Z] so that it works in any locale.

Conspicuous Compiler
Thanks, works like a charm.
Stefan
+1 for considering locale
Hobo
+4  A: 

A simpler version of Conspicuous Compiler's answer uses the -p flag to cut out all that ugly loop code:

perl -p -e 'm/\b([[:upper:]]{2,})\b/' input.txt
ire_and_curses
Two problems with this: (1) It prints out the entire line when it matches. (2) It only matches once per line, so it won't get multiple abbreviations on one line. You'll need at least one loop.
Conspicuous Compiler
Here's a variant which prints out only the abbreviation, but only gets the last abbreviation per line: perl -p -e 's/.*\b([[:upper:]]{2,})\b.*/\1/' textinput.txt
Conspicuous Compiler
@Conspicuous Compiler:Good points. I've asked the OP for a bit of clarfication. I'll delete this answer if it's not sufficient.
ire_and_curses
A: 

Here's a Python 2.x solution that allows for digits (see example). Update: Code now works for Python 3.1, 3.0 and 2.1 to 2.6 inclusive.

dos-prompt>type find_acronyms.py
import re

try:
    set
except NameError: 
    try:
        from sets import Set as set # Python 2.3
    except ImportError: 
        class set: # Python 2.2 and earlier
            # VERY minimal implementation
            def __init__(self):
                self.d = {}
            def add(self, element):
                self.d[element] = None
            def __str__(self):
                return 'set(%s)' % self.d.keys()

word_regex = re.compile(r"\w{2,}", re.LOCALE)
# min length is 2 characters

def accumulate_acronyms(a_set, an_iterable):
    # updates a_set in situ
    for line in an_iterable:
        for word in word_regex.findall(line):
            if word.isupper() and "_" not in word:
                a_set.add(word)

test_data = """
A BB CCC _DD EE_ a bb ccc k9 K9 A1
It's a CHARLIE FOXTROT, said MAJ Major Major USAAF RETD.
FBI CIA MI5 MI6 SDECE OGPU NKVD KGB FSB
BB CCC # duplicates
_ABC_DEF_GHI_ 123 666 # no acronyms here
"""

result = set()
accumulate_acronyms(result, test_data.splitlines())
print(result)


dos-prompt>\python26\python find_acronyms.py
set(['CIA', 'OGPU', 'BB', 'RETD', 'CHARLIE', 'FSB',
'NKVD', 'A1', 'SDECE', 'KGB', 'MI6', 'USAAF', 'K9', 'MAJ',
'MI5', 'FBI', 'CCC', 'FOXTROT'])
# Above output has had newlines inserted for ease of reading.
# Output from 3.0 & 3.1 differs slightly in presentation.
# Output from 2.1 differs in item order.
John Machin
Semantically, there is the general class of "shortened words", abbreviations which include initialisms (formed of the initial letters of a series of words) and acronyms (a pronounceable abbreviation that may or may not be an initialism). Initialisms are almost always in all caps. Other types of abbreviations may or may not be. There's not a succinct word for "all caps words" that I know of...
Conspicuous Compiler
'\b' is worth using
Alexandr Ciornii
@Alexandr: Please supply a concrete example of where/how `'\b'` would be worth using.
John Machin
@JohnMachin: You rely on the set class existing as built in, which it isn't until Python 2.4. This is complicated by the need for parentheses on a call to print in Python 3.0 -- a somewhat narrow version window in which this works. -- However, you're right about not needing \b since you do an isupper() test.
Conspicuous Compiler
The `set` is there merely as a reminder that one might not want to churn out duplicate results (as some other solutions do). 2.4 to 2.6 is NOT a narrow window. In any case, users of 2.3 know what to do to get `set`, and 0 <= count(persons writing new code for earlier versions) <= epsilon. Users of 3.x know to mentally insert `(` and `)` when they see `print x`. For light entertainment, I edited my answer to support 2.1 to 3.1 and tested it. It will probably work on 2.0 but I wasn't about to download it. What versions of perl do your answers work on?
John Machin
My Perl sample, from what I have available to test, works from 5.4.x through 5.10.x which I guess only covers 12 years of Perl. My bad. -- ANYhow, thank you for the update. I honestly only had systems with 3.x and 2.3 to test with, so I wasn't just nitpicking. I had to learn Python to figure out what was wrong, fix it, and test it to see if Alexandr had a valid complaint.
Conspicuous Compiler
Also, since we're dealing with one-liners, I presumed the results would have been given a quick |sort|uniq if that was what was desired. Unix-style thinking.
Conspicuous Compiler