ansaurus

Question

Answer 1

A:

Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.

Ned Batchelder 2009-11-20 01:02:12

http://pyparsing.wikispaces.com/

ʞɔıu 2009-11-20 01:08:42

I know what pyparsing is, I just wonder why you would use it for the messy job of parsing HTML when existing specialized modules already exist.

Ned Batchelder 2009-11-20 01:16:07

+1 for BeautifulSoup

John Keyes 2009-11-20 01:23:00

I updated the question with the reason why I don't use BeautifulSoup. Short answer: because BeautifulSoup gets lots of parse errors, but I don't have the same problem with pyparsing. If there's a better way to use BeautifulSoup that I don't know about or there's something else I'm missing I would be really interested in learning about that, however.

ʞɔıu 2009-11-20 15:14:43

Answer 2

+3 A:

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

If you want to capture the character location of an expression within your parser, insert one of these, with a results name:

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)

Paul McGuire 2009-11-20 04:02:28

The loc thing worked but I couldn't seem to get the Optional thing to work. Are you sure that code sample works?

ʞɔıu 2009-11-20 16:08:02

Well *that* example doesn't work, because the `<a>` tag *doesn't* immediately follow the `<embed>` tag. I didn't follow what you meant by *follow*. What do you mean by *follow*?

Paul McGuire 2009-11-20 21:21:43

In the example, the embed tag is followed by some stuff, shown by ellipses, a close-embed tag, a close-object tag, an empty BR tag, and *then* the A tag.

Paul McGuire 2009-11-20 21:23:15

could I do something like embedTag + skipTo(endEmbedTag) + Optional(endObjectTag + brTag + aTag) ?

ʞɔıu 2009-11-21 22:13:15

`embedTag + pyparsing.SkipTo(endEmbedTag, include=True) + pyparsing.Optional(endObjectTag + brTag + aTag)` should work for *this specific case*. But I would not be surprised if your HTML had other tags in there in unpredictable places. If you want to match an `<embed>` tag that is followed by an `<a>` tag, this might be a little more robust: `embedTag + pyparsing.SkipTo(aTag, failOn=embedTag) + aTag | embedTag`. In this case, SkipTo advances directly to the next aTag, but fails if there is another embedTag found first. But I'm in Pure Speculation Land here, so you have to fill in the rest.

Paul McGuire 2009-11-21 23:33:23

Answer 3

A:

you don't prefer using normal regex? or because its bad habit to parse html? :D

re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

S.Mark 2009-11-20 15:21:13

everyone on SO now knows that parsing HTML with regex is a crime against Man; cite: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

ʞɔıu 2009-11-20 15:35:18

:D I see, thats my first impression within these 2 days when I joined too.

S.Mark 2009-11-20 15:45:41

I'm actually not opposed to using regex per se, but I've used that approach in the past and I'm trying to learn a better way. I could do that but I would still need/want a parser to parse out the HTML attributes, etc, and I may end up using a hybrid approach using a little of both.

ʞɔıu 2009-11-20 16:06:38

Answer 4

A:

I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a

Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).

gibson 2009-11-20 19:48:25

Would have added this as a comment on the first question but I don't have enough rep.

gibson 2009-11-20 19:49:22

ansaurus

tags:

views:

answers:

Find following tag with pyparsing

related questions