ansaurus

Question

Answer 1

+3 A:

>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>

S.Mark 2010-04-27 09:29:59

@S.Mark, sorry that I didn't make the question clear, please see the edit about undefined number of rows between the two blank lines.

ohho 2010-04-27 09:34:53

@Horace, added \n+ to match more than 2 blank lines

S.Mark 2010-04-27 09:52:38

@S.Mark, is it possible to take away (minutes) from the re? as "minutes" does not necessary show up at last row

ohho 2010-04-27 10:11:40

@Horace, Yeah, you could change `(minutes)` to `.*` if you don't want in the result.

S.Mark 2010-04-27 10:14:50

Answer 2

+1 A:

That's a really basic question, I could code it for you, sure, but I think it would be better to just learn python basics and do it yourself.

If you just use a pre-made answer here, you'll encounter similar problems pretty much at every step.

http://diveintopython3.org/

Please do not feel offended :)

Lo'oris 2010-04-27 09:30:09

Thank you for the great advice, but I am really in a hurry ;-)

ohho 2010-04-27 09:31:27

Answer 3

+2 A:

The simplest approach is to go over these lines (assuming you have a list of lines, or a file, or split the string into a list of lines) until you see a line that's just '\n', then check that each line starts with '- ' (using the startswith string method) and slicing it off, storing the result, until you find another empty line. For example:

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

Edited: Since you elaborate on what you want to do, here's an updated version of the loops. It no longer loops twice, but instead collects data until it encounters a 'bad' line, and either saves or discards the collected lines when it encounters a block separator. It doesn't need an explicit iterator, because it doesn't restart iteration, so you can just pass it a list (or any iterable) of lines:

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

And here it is in action:

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

Thomas Wouters 2010-04-27 09:32:35

@Thomas Wouters, "for line" is not reliable (otherwise I won't tag this question with multiline ;-) I can only start the matching _after_ "\n\n- " (two linefeeds then a leading minus-sign and space)

ohho 2010-04-27 09:38:40

That wasn't (and still isn't) in your question, but the basic approach remains the same. You can still use iteration over lines just fine, but you'll have to clarify what you actually have and actually want if you want me to write down an example. What if there's lines that don't start with "- " inbetween lines that do? What if there's multiple such blocks? What if the lines aren't empty but just have some whitespace?

Thomas Wouters 2010-04-27 09:49:48

please see my 2nd edit..

ohho 2010-04-27 10:13:37

I'm still not sure how my current answer doesn't work for you. (I don't see a second edit?)

Thomas Wouters 2010-04-27 10:19:50

@Thomas Wouters updated, "a continuous of good lines between 2 empty lines" section, thx

ohho 2010-04-27 10:31:50

Answer 4

+1 A:

>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']

SilentGhost 2010-04-27 09:47:59

Answer 5

+1 A:

l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

do this:

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

and have this:

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]

remosu 2010-04-27 10:20:41

ansaurus

tags:

views:

answers:

multi-line pattern matching in pyhon

related questions