ansaurus

Question

Answer 1

+5 A:

I would use regular expressions. This answer assumes that none of the tag characters {}[] appear within other tag characters.

import re
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

for s in re.findall(r'\{\[(.*?)\]\}', text):
    print s

Using the verbose mode in python regular expressions:

re.findall('''
    \{   # opening curly brace
    \[   # followed by an opening square bracket
    (    # capture the next pattern
    .*?  # followed by shortest possible sequence of anything
    )    # end of capture
    \]   # followed by closing square bracket
    \}   # followed by a closing curly brace
    ''', text, re.VERBOSE)

Bryan Oakley 2010-06-14 19:11:49

Answer 2

+3 A:

This is a job for regex:

>>> import re
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> re.findall(r'\{\[(\w+)\]\}', text)
['really', 'way', 'get', 'from']

Daniel Roseman 2010-06-14 19:12:48

Wow, that was fast.. and perfect. Thanks!

chris 2010-06-14 19:15:48

@chris: be careful with this: it only captures alphanumerics between the delimiters. If your data has other sorts of characters, this won't pick them up.

Bryan Oakley 2010-06-14 19:22:25

To expound on Bryan's comment, the specific cases of: hyphenated words, {[anti-war]}; compound words with whitespace, {[New England]}; names of places or people that use punctuation and whitespace, {[Boston, MA]}, {[George W. Bush]}.

tgray 2010-06-14 20:59:27

Answer 3

+1 A:

slower, bigger, no regular expresions

the old school way :P

def f(s):
    result = []
    tmp = ''
    for c in s:
        if c in '{[':
            stack.append(c)
        elif c in ']}':
            stack.pop()
            if c == ']':
                result.append(tmp)
                tmp = ''
        elif stack and stack[-1] == '[':
            tmp += c
    return result

>>> s
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> f(s)
['really', 'way', 'get', 'from']

remosu 2010-06-15 08:18:07

Answer 4

A:

Another way

def between_strings(source, start='{[', end=']}'):
    words = []
    while True:
        start_index = source.find(start)
        if start_index == -1:
            break
        end_index = source.find(end)
        words.append(source[start_index+len(start):end_index])
        source = source[end_index+len(end):]
    return words


text = "this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it."
assert between_strings(text) == ['really', 'way', 'get', 'from']

Henry 2010-06-22 03:39:31

ansaurus

tags:

views:

answers:

Parse items from text file

related questions