ansaurus

Question

Using BeautifulSoup to parse lines seperated by tags?

Answer 1

+1 A:

Once you have this HTML fragment, just use a regex to replace   followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.

Ignacio Vazquez-Abrams 2010-02-21 07:34:33

Thanks for the answer, but unfortunately it's not as simple as just using a regex. I've simplified the above document to better illustrate my question. The actual document has a jumble of HTML formatting tags and the like.

jamieb 2010-02-21 07:46:06

But you don't *care* about the document, just the part separated by ` ` tags. Use BeatifulSoup to extract that part first.

Ignacio Vazquez-Abrams 2010-02-21 07:50:07

I'm not sure why someone downvoted your answer; I appreciate the help. I will try a couple of ideas based on your suggestion. I was just hoping that BeautifulSoup would have eliminated the need for manual parsing. Thank you.

jamieb 2010-02-21 07:58:23

BeautifulSoup is good for the tags that deal with structure and style, but ` ` doesn't fall into either of those.

Ignacio Vazquez-Abrams 2010-02-21 08:01:44

While I probably would have preferred to work with Michal's answer, I didn't see it until after I completed my project. I was able to do what I needed using your suggestion. Thank you.

jamieb 2010-02-22 16:07:09

Answer 2

A:

you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of   to some other delimiter like |. after that you can get your fields.

html="""
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
"""
import re
newhtml=html.replace("\n","")
pat=re.compile("(<br \/>){2,}",re.DOTALL|re.M)
print pat.sub("|",newhtml)

output

$ ./python.py
Company A<br />123 Main St.<br />Suite 101<br />Someplace, NY 1234|Company B<br />456 Main St.<br />Someplace, NY 1234|

Now your company information are separated by pipes.

ghostdog74 2010-02-21 08:35:10

Answer 3

A:

Perhaps you could use this function:

def partition_by(pred, iterable):
    current = None
    current_flag = None
    chunk = []
    for item in iterable:
        if current is None:
            current = item
            current_flag = pred(current)
            chunk = [current]
        elif pred(item) == current_flag:
            chunk.append(item)
        else:
            yield chunk
            current = item
            current_flag = not current_flag
            chunk = [current]
    if len(chunk) > 0:
        yield chunk

Add something to check for being a   tag or newline:

def is_br(bs):
    try:
        return bs.name == u'br'
    except AttributeError:
        return False

def is_br_or_nl(bs):
    return is_br(bs) or u'\n' == bs

(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)

Then use partition_by(is_br_or_nl, cs) to yield (for cs set to BeautifulSoup.BeautifulSoup(your_example_html).childGenerator())

[[u'Company A'],
 [<br />],
 [u'\n123 Main St.'],
 [<br />],
 [u'\nSuite 101'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />],
 [u'\nCompany B'],
 [<br />],
 [u'\n456 Main St.'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />]]

That should be easy enough to process.

To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.

Michał Marczyk 2010-02-21 09:16:20

ansaurus

tags:

views:

answers:

Using BeautifulSoup to parse lines seperated by <br> tags?

related questions