ansaurus

Question

How can I find all initializations in a text?

Answer 1

+2 A:

A regular expression like /[A-Z]{2,}/ should do the trick.

Brian Rasmussen 2009-08-08 10:46:39

That matches also strings like fooBARbaz. And it doesn't support international character sets.

Juha Syrjälä 2009-08-08 11:34:00

Correct but the OP said "plain text", so I figure this would be good enough. If the text does contain words like that you need to enclose the pattern in \b tags and if an international character set is used additional characters must be added.

Brian Rasmussen 2009-08-08 21:17:51

Answer 2

+17 A:

Here you go:

perl -e 'for (<>) { for (m/\b([[:upper:]]{2,})\b/) { print "$1\n"; } }' textinput.txt

Grabs all all-uppercase words that are at least two characters long. I use [[:upper:]] instead of [A-Z] so that it works in any locale.

Conspicuous Compiler 2009-08-08 10:47:43

Thanks, works like a charm.

Stefan 2009-08-08 10:57:00

+1 for considering locale

Hobo 2009-08-08 11:11:06

Answer 3

+4 A:

A simpler version of Conspicuous Compiler's answer uses the -p flag to cut out all that ugly loop code:

perl -p -e 'm/\b([[:upper:]]{2,})\b/' input.txt

ire_and_curses 2009-08-08 11:02:22

Two problems with this: (1) It prints out the entire line when it matches. (2) It only matches once per line, so it won't get multiple abbreviations on one line. You'll need at least one loop.

Conspicuous Compiler 2009-08-08 11:08:03

Here's a variant which prints out only the abbreviation, but only gets the last abbreviation per line: perl -p -e 's/.*\b([[:upper:]]{2,})\b.*/\1/' textinput.txt

Conspicuous Compiler 2009-08-08 11:25:55

@Conspicuous Compiler:Good points. I've asked the OP for a bit of clarfication. I'll delete this answer if it's not sufficient.

ire_and_curses 2009-08-08 11:26:34

Answer 4

A:

Here's a Python 2.x solution that allows for digits (see example). Update: Code now works for Python 3.1, 3.0 and 2.1 to 2.6 inclusive.

dos-prompt>type find_acronyms.py
import re

try:
    set
except NameError: 
    try:
        from sets import Set as set # Python 2.3
    except ImportError: 
        class set: # Python 2.2 and earlier
            # VERY minimal implementation
            def __init__(self):
                self.d = {}
            def add(self, element):
                self.d[element] = None
            def __str__(self):
                return 'set(%s)' % self.d.keys()

word_regex = re.compile(r"\w{2,}", re.LOCALE)
# min length is 2 characters

def accumulate_acronyms(a_set, an_iterable):
    # updates a_set in situ
    for line in an_iterable:
        for word in word_regex.findall(line):
            if word.isupper() and "_" not in word:
                a_set.add(word)

test_data = """
A BB CCC _DD EE_ a bb ccc k9 K9 A1
It's a CHARLIE FOXTROT, said MAJ Major Major USAAF RETD.
FBI CIA MI5 MI6 SDECE OGPU NKVD KGB FSB
BB CCC # duplicates
_ABC_DEF_GHI_ 123 666 # no acronyms here
"""

result = set()
accumulate_acronyms(result, test_data.splitlines())
print(result)


dos-prompt>\python26\python find_acronyms.py
set(['CIA', 'OGPU', 'BB', 'RETD', 'CHARLIE', 'FSB',
'NKVD', 'A1', 'SDECE', 'KGB', 'MI6', 'USAAF', 'K9', 'MAJ',
'MI5', 'FBI', 'CCC', 'FOXTROT'])
# Above output has had newlines inserted for ease of reading.
# Output from 3.0 & 3.1 differs slightly in presentation.
# Output from 2.1 differs in item order.

John Machin 2009-08-08 14:46:16

Semantically, there is the general class of "shortened words", abbreviations which include initialisms (formed of the initial letters of a series of words) and acronyms (a pronounceable abbreviation that may or may not be an initialism). Initialisms are almost always in all caps. Other types of abbreviations may or may not be. There's not a succinct word for "all caps words" that I know of...

Conspicuous Compiler 2009-08-08 17:59:33

'\b' is worth using

Alexandr Ciornii 2009-08-08 21:46:18

@Alexandr: Please supply a concrete example of where/how `'\b'` would be worth using.

John Machin 2009-08-08 22:32:32

@JohnMachin: You rely on the set class existing as built in, which it isn't until Python 2.4. This is complicated by the need for parentheses on a call to print in Python 3.0 -- a somewhat narrow version window in which this works. -- However, you're right about not needing \b since you do an isupper() test.

Conspicuous Compiler 2009-08-10 03:35:55

The `set` is there merely as a reminder that one might not want to churn out duplicate results (as some other solutions do). 2.4 to 2.6 is NOT a narrow window. In any case, users of 2.3 know what to do to get `set`, and 0 <= count(persons writing new code for earlier versions) <= epsilon. Users of 3.x know to mentally insert `(` and `)` when they see `print x`. For light entertainment, I edited my answer to support 2.1 to 3.1 and tested it. It will probably work on 2.0 but I wasn't about to download it. What versions of perl do your answers work on?

John Machin 2009-08-10 12:57:38

My Perl sample, from what I have available to test, works from 5.4.x through 5.10.x which I guess only covers 12 years of Perl. My bad. -- ANYhow, thank you for the update. I honestly only had systems with 3.x and 2.3 to test with, so I wasn't just nitpicking. I had to learn Python to figure out what was wrong, fix it, and test it to see if Alexandr had a valid complaint.

Conspicuous Compiler 2009-08-11 12:17:38

Also, since we're dealing with one-liners, I presumed the results would have been given a quick |sort|uniq if that was what was desired. Unix-style thinking.

Conspicuous Compiler 2009-08-11 12:20:04

ansaurus

tags:

views:

answers:

How can I find all initializations in a text?

related questions